Home News A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78...

A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples

0

Can carefully selected, tool-integrated demonstrations create more effective software agents than vast collections of generic instructional data? Researchers from Shanghai Jiao Tong University and the SII Generative AI Research Lab (GAIR) introduce LIMI (“Less Is More for Agency”), a supervised fine-tuning technique that transforms a foundational model into a proficient software and research agent using only 78 meticulously curated examples. LIMI achieves an impressive 73.5% average score on AgencyBench (FTFC 71.7, RC@3 74.2, SR@3 74.6), outperforming strong competitors such as GLM-4.5 (45.1), Qwen3-235B-A22B (27.5), Kimi-K2 (24.1), and DeepSeek-V3.1 (11.9). Remarkably, LIMI surpasses models trained on 10,000 samples, utilizing 128 times less data.

Introducing the Agency Efficiency Paradigm

LIMI is grounded in the principle that agentic capability is more influenced by the quality and structure of training data than by sheer volume. The team fine-tuned GLM-4.5 and GLM-4.5-Air models on 78 extensive, tool-driven task trajectories, demonstrating significant improvements on AgencyBench and other generalization benchmarks such as TAU2-bench, EvalPlus-HE/MBPP, DS-1000, and SciCode.

Focused Supervision with Rich Trajectories

Each training trajectory, ranging from approximately 13,000 to 152,000 tokens (averaging 42,400 tokens), encapsulates comprehensive multi-turn workflows. These include the model’s reasoning processes, tool invocations, and environmental feedback, all recorded within the SII-CLI execution environment. The tasks cover interactive software development-termed “vibe coding”-and complex research workflows involving search, data analysis, and experimental design.

Methodology Overview

  • Base Models: The experiments utilize GLM-4.5 (355 billion parameters) and GLM-4.5-Air (106 billion parameters), fine-tuned using the slime supervised fine-tuning framework with consistent configurations to isolate the impact of data quality.
  • Data Collection: The dataset comprises 60 authentic queries from industry practitioners and 18 synthesized queries derived from highly rated GitHub pull requests, all rigorously validated by PhD-level annotators. For each query, the full agent trajectory leading to task completion is logged within the SII-CLI environment.
  • Evaluation Metrics: Performance is assessed on AgencyBench with three rounds of evaluation, measuring FTFC, SR@3, and RC@3. Additional testing includes generalization benchmarks such as TAU2-airline/retail Pass^4, EvalPlus HE/MBPP, DS-1000, and SciCode.

Performance Highlights

  • AgencyBench Results: LIMI achieves a 73.5% average score, marking a +28.4 point improvement over the base GLM-4.5. Specifically, FTFC rises to 71.7% from 37.8%, and SR@3 climbs to 74.6% from 47.4%.
  • Data Efficiency: Despite using only 78 samples, LIMI outperforms GLM-4.5 models trained on the AFM-CodeAgent SFT dataset containing 10,000 samples (73.5% vs. 47.8%), representing a 53.7% absolute gain with 128 times less data. Similar advantages are observed when compared to AFM-WebAgent (7,610 samples) and CC-Bench-Traj (260 samples).
  • Robust Generalization: LIMI maintains strong performance across diverse domains including tool use, coding, and scientific computing, averaging around 57%. Even without access to external tools, LIMI slightly outperforms GLM-4.5 (50.0% vs. 48.7%), indicating inherent improvements beyond environment interaction.

Essential Insights

  1. Prioritizing Data Quality Over Quantity: LIMI’s curated, long-horizon trajectories enable it to achieve superior results with dramatically fewer samples compared to traditional large-scale fine-tuning approaches.
  2. Comprehensive Workflow Representation: The training data’s depth-capturing multi-turn interactions, tool orchestration, and environment feedback-facilitates nuanced agent behavior in complex software and research tasks.
  3. Consistent Gains Across Metrics: LIMI demonstrates substantial improvements in FTFC, SR@3, and RC@3 on AgencyBench, alongside strong generalization to external benchmarks.
  4. Scalability Across Model Sizes: The approach proves effective for both large (355B) and medium (106B) parameter models, underscoring its adaptability.

Final Thoughts

This study showcases how a small set of expertly curated, tool-grounded trajectories can dramatically enhance the capabilities of AI agents in software engineering and research domains. By focusing on rich, multi-turn workflows within a command-line interface environment, LIMI achieves state-of-the-art performance with a fraction of the data typically required. These findings suggest a promising direction for developing efficient, high-performing AI agents that leverage quality over quantity in training data.

Exit mobile version