Microsoft Releases Agent Lightning: A New AI Framework that Enables Reinforcement Learning (RL)-based Training of LLMs for Any AI Agent

October 30, 2025

How can real agent execution logs be transformed into reinforcement learning (RL) transitions to enhance policy large language models (LLMs) without overhauling your current agent infrastructure? The Microsoft AI team introduces Agent Lightning, an open-source framework designed to streamline RL training for any AI agent seamlessly. By decoupling training from runtime, standardizing trace formats, and implementing LightningRL-a hierarchical approach that converts intricate agent behaviors into RL-compatible transitions-Agent Lightning enables optimization using conventional single-turn RL trainers without rewriting existing agents.

Transforming Agent Behavior into Reinforcement Learning Transitions

Agent Lightning conceptualizes an AI agent as a decision-making process modeled by a partially observable Markov decision process (POMDP). Here, the agent’s observation corresponds to the current input fed into the policy LLM, the action represents the model invocation, and rewards can be either intermediate or terminal. During each agent run, the framework isolates only the policy model’s calls, capturing their inputs, outputs, and associated rewards. This filtration removes extraneous framework noise, producing clean, high-quality transitions ideal for RL training.

LightningRL further enhances this by performing credit assignment over multi-step episodes, then optimizing the policy using single-turn RL objectives. This approach is compatible with popular RL algorithms like Proximal Policy Optimization (PPO) and Generalized Reinforcement Policy Optimization (GRPO), commonly implemented in trainers such as VeRL, facilitating straightforward integration into existing workflows.

Architecture Designed for Scalable Multi-Agent Training

At its core, Agent Lightning employs a Training Agent Disaggregation architecture. The system splits responsibilities between a Lightning server, which handles training and model serving via an OpenAI-compatible API, and a Lightning client, which operates within the agent’s native runtime environment. The client captures detailed traces of prompts, tool invocations, and reward signals, streaming this telemetry back to the server. This design keeps critical dependencies-like browsers, shells, and tools-close to production environments, while computationally intensive GPU training is centralized on the server side.

The runtime supports two tracing mechanisms: a default path leveraging OpenTelemetry spans for seamless integration with standard telemetry collectors, and a lightweight embedded tracer for teams preferring minimal deployment overhead. Both methods funnel data into a unified store, ensuring consistent training inputs regardless of tracing strategy.

Unified Data Interface for Multi-Agent Optimization

Agent Lightning records every model and tool call as a span enriched with inputs, outputs, and metadata. These spans are algorithmically transformed into ordered triplets consisting of prompt, response, and reward. This selective extraction enables targeted optimization of individual agents within complex multi-agent workflows-or simultaneous training of multiple agents-without modifying orchestration logic. Additionally, the same trace data can be repurposed for automated prompt tuning or supervised fine-tuning, broadening its utility.

Empirical Validation Across Diverse Benchmarks

The framework’s efficacy has been demonstrated on three distinct tasks:

Text-to-SQL Generation: Utilizing the Spider benchmark, which includes over 10,000 questions spanning 200 databases across 138 domains, the team employed Llama 3.2 3B Instruct as the policy model. The implementation leveraged LangChain with a writer agent, a rewriter agent, and a checker agent. Both the writer and rewriter were optimized via RL, while the checker remained static. Training yielded consistent reward improvements and enhanced test-time performance.
Retrieval-Augmented Generation (RAG): Using the MuSiQue benchmark alongside a Wikipedia-scale index containing approximately 21 million documents, the retriever employed BGE embeddings with cosine similarity. The agent was built on the OpenAI Agents SDK. Rewards combined format adherence and F1 correctness scores, with training showing stable gains in both training and evaluation phases using the same base model.
Mathematical Question Answering with Tool Use: Implemented via AutoGen, this agent integrated a calculator tool to solve problems from the Calc X dataset. Again, Llama 3.2 3B Instruct served as the base model. Training enhanced the agent’s proficiency in correctly invoking tools and incorporating their outputs into final answers.

Core Advantages and Insights

Seamless Integration: Thanks to Training Agent Disaggregation and a standardized trace interface, Agent Lightning connects effortlessly with existing agents built on platforms like LangChain, OpenAI Agents SDK, AutoGen, or CrewAI, requiring minimal to no code modifications.
Efficient Transition Conversion: LightningRL translates complex agent trajectories into RL transitions, applying credit assignment over multi-step interactions before optimizing policies with single-turn RL algorithms such as PPO or GRPO.
Automatic Intermediate Rewarding (AIR): AIR generates dense feedback by converting system signals-like tool execution statuses-into intermediate rewards, effectively mitigating the sparse reward problem common in extended workflows.
Robust Benchmark Performance: The framework’s validation on Spider, MuSiQue, and Calc X datasets, all using Llama 3.2 3B Instruct as the foundation, underscores its versatility across diverse AI tasks.
Scalable and Production-Friendly Runtime: By capturing traces through OpenTelemetry and streaming them to a centralized training server, Agent Lightning supports scalable model updates via an OpenAI-compatible endpoint without relocating tools or disrupting production environments.

Final Thoughts

Agent Lightning offers a pragmatic solution bridging the gap between agent execution and reinforcement learning, eliminating the need for extensive framework rewrites. By formalizing agent interactions as Markov Decision Processes and introducing LightningRL for effective credit assignment, it enables existing single-turn RL trainers to optimize complex agent behaviors. The Training Agent Disaggregation model preserves existing infrastructure by separating runtime and training concerns, while Automatic Intermediate Rewarding enriches feedback signals to accelerate learning in lengthy workflows. Overall, Agent Lightning presents a streamlined, low-integration pathway for agents to learn directly from their operational traces, marking a significant step forward in scalable AI agent training.

Loading…

Here are the results for the search: "{{td_search_query}}"

No results!

{{post_title}}

Transforming Agent Behavior into Reinforcement Learning Transitions

Architecture Designed for Scalable Multi-Agent Training

Unified Data Interface for Multi-Agent Optimization

Empirical Validation Across Diverse Benchmarks

Core Advantages and Insights

Final Thoughts

RELATED ARTICLES

The AI lab revolving door spins ever faster

This AI finds simple rules where humans see only chaos

This tiny chip could change the future of quantum computing