Optimizing Coding LLM Performance through Human Data Generation and Feedback

AI/ML development

GenAI

Human-Generated Data

LLM Training

Industry

GenAI

Technology

Python, Jupyter Notebook, PyTorch, Label Studio

Location

USA

Client since

2024

Client Overview

Business Challenge

Solution

Results

Client Overview

Our client is a top tech innovator known worldwide for mission-critical software. They wanted to upgrade their internally developed coding LLM, which handles automated code reviews and bug detection, to be more precise and capable of managing the complexities of a vast legacy codebase.

Business Challenge

The main issue our client faced was their existing coding language model struggling to catch subtle issues and quirks buried within legacy code. The root cause was unreliable training data, mostly due to outdated or inconsistent annotations. Standard automatic labeling tools fell short because they couldn’t grasp the intricate context unique to older systems, resulting in several key problems.

The varying quality of annotations across different code samples led to inconsistent model training results.
The existing datasets didn’t sufficiently capture common issues or established best practices relevant to legacy software environments.
The lack of an effective, iterative human review process hindered the rapid identification and correction of annotation errors, delaying improvement efforts.

Solution

First, we assembled a dedicated group of experienced developers to comb through the legacy codebase. The team hand-picked over 10,000 relevant snippets, carefully labeling each piece to identify specific inefficiencies, common bugs, and potential optimizations.

Then, annotations went through several rounds of peer checking and refinement. Our software developers reviewed each other’s annotations, improving accuracy and consistency. We also introduced automated checks that quickly flagged questionable labels, speeding up the human reviewers’ work and optimizing overall costs.

Once refined, the high-quality, curated data went directly into the client’s existing model-training setup. Continuous human feedback loops allowed immediate tweaks during training, helping the model rapidly adapt to real-world legacy code complexities.

RESULTS

Error detection got 35% better. With higher-quality data, the model became much more effective at spotting subtle coding issues.
Reduced annotation inconsistencies by 40%. Thorough peer reviews helped significantly cut down on labeling mistakes and variations.
Review time shortened by 25%. Combining automation and expert review sped up the annotation cycle, enabling faster model improvements.
Built a sustainable, repeatable framework. The methods used formed a solid foundation for consistently improving future datasets and could be easily adapted to other legacy code projects.

from the client’s perspective

Collaborating with StartupSoft completely transformed our approach to automated code review. Their rigorous data generation and hands-on feedback process refined our training dataset to a degree we never thought possible. Our LLM now catches intricate issues with unmatched precision.

Jared M.

Head of Human Evaluation and Data Generation

view Recent Case Studies

Compliant by Design Secure LLM Finetuning for Financial Analytics

Enhancing a Coding LLM with Robust Debugging via RLHF

let’s Turn your big plans into a success

Full Name

Your Email

Company Headcount

Company Name

I’m interested in ...

HIRING A TEAM OUTSORSING A PROJECT AI/ML OTHER

Message

By proceeding, I agree with the collection and processing of my personal data as described in the Privacy Policy