Optimizing Coding LLM Performance through Human Data Generation and Feedback

Data case cover

Industry

GenAI

Technology

Python, Jupyter Notebook, PyTorch, Label Studio

Location

USA

Client since

2024

Client Overview

Our client is a top tech innovator known worldwide for mission-critical software. They wanted to upgrade their internally developed coding LLM, which handles automated code reviews and bug detection, to be more precise and capable of managing the complexities of a vast legacy codebase.

Business Challenge

The main issue our client faced was their existing coding language model struggling to catch subtle issues and quirks buried within legacy code. The root cause was unreliable training data, mostly due to outdated or inconsistent annotations. Standard automatic labeling tools fell short because they couldn’t grasp the intricate context unique to older systems, resulting in several key problems.

  • The varying quality of annotations across different code samples led to inconsistent model training results.
  • The existing datasets didn’t sufficiently capture common issues or established best practices relevant to legacy software environments.
  • The lack of an effective, iterative human review process hindered the rapid identification and correction of annotation errors, delaying improvement efforts.

Solution

First, we assembled a dedicated group of experienced developers to comb through the legacy codebase. The team hand-picked over 10,000 relevant snippets, carefully labeling each piece to identify specific inefficiencies, common bugs, and potential optimizations.

Then, annotations went through several rounds of peer checking and refinement. Our software developers reviewed each other’s annotations, improving accuracy and consistency. We also introduced automated checks that quickly flagged questionable labels, speeding up the human reviewers’ work and optimizing overall costs.

Once refined, the high-quality, curated data went directly into the client’s existing model-training setup. Continuous human feedback loops allowed immediate tweaks during training, helping the model rapidly adapt to real-world legacy code complexities.

RESULTS

  • Error detection got 35% better. With higher-quality data, the model became much more effective at spotting subtle coding issues.

  • Reduced annotation inconsistencies by 40%. Thorough peer reviews helped significantly cut down on labeling mistakes and variations.

  • Review time shortened by 25%. Combining automation and expert review sped up the annotation cycle, enabling faster model improvements.

  • Built a sustainable, repeatable framework. The methods used formed a solid foundation for consistently improving future datasets and could be easily adapted to other legacy code projects.

let’s Turn your big plans into a success

    By proceeding, I agree with the collection and processing of my personal data as described in the Privacy Policy