Home Industries Education How to Build a Model-Native Agent That Learns Internal Planning, Memory, and...

How to Build a Model-Native Agent That Learns Internal Planning, Memory, and Multi-Tool Reasoning Through End-to-End Reinforcement Learning

0

Integrating Planning, Memory, and Tool Use in a Unified Neural Agent

This guide delves into how an autonomous agent can internalize complex cognitive functions-such as planning, memory retention, and tool utilization-within a single neural framework, eliminating the need for external control mechanisms. We develop a streamlined, model-native agent that masters arithmetic reasoning tasks through reinforcement learning. By leveraging a stage-sensitive actor-critic architecture alongside a curriculum of progressively challenging problem environments, the agent learns to employ internal “tools” and short-term memory effectively to solve problems end-to-end. This stepwise approach reveals the agent’s evolution from basic reasoning to sophisticated multi-step compositional strategies.

Constructing a Symbolic Environment for Internal Tool Use

Our first step involves creating a synthetic environment where the agent interacts with symbolic tools representing arithmetic operations like multiplication, addition, and subtraction. Each action corresponds to an internal tool the agent can invoke. This setup simulates reasoning challenges requiring the agent to plan sequences of tool applications to derive correct answers. The environment supports three stages of increasing complexity, encouraging the agent to develop compositional reasoning skills.

import math, random, torch, torch.nn as nn, torch.nn.functional as F
device = "cuda" if torch.cuda.isavailable() else "cpu"
torch.manualseed(0)
random.seed(0)

V = 18  # Vocabulary size
CTX = 10
MUL, ADD, SUB, ANS, STO, RCL, EOS = 11, 12, 13, 14, 15, 16, 17

tok2str = {*{i: str(i) for i in range(10)},
           CTX: "[CTX]", MUL: "[MUL]", ADD: "[ADD]", SUB: "[SUB]",
           ANS: "[ANS]", STO: "[STO]", RCL: "[RCL]", EOS: "[EOS]"}

class ToolEnv:
    def init(self, maxsteps=7):
        self.maxsteps = maxsteps

    def sample(self, stage):
        a, b, c, d, e = [random.randint(0, 9) for  in range(5)]
        if stage == 0:
            ctx = [a, b, c]
            target = a  b + c
        elif stage == 1:
            ctx = [a, b, c, d]
            target = (a  b + c) - d
        else:
            ctx = [a, b, c, d, e]
            target = (a  b + c) - (d  e)
        return ctx, target, (a, b, c, d, e)

    def stepseq(self, actions, abc, stage):
        a, b, c, d, e = abc
        last = None
        mem = None
        steps = 0
        shaped = 0.0

        goal0 = a  b
        goal1 = goal0 + c
        goal2 = goal1 - d
        goal3 = d  e
        goal4 = goal1 - goal3

        for act in actions:
            steps += 1
            if act == MUL:
                last = a  b if last is None else last  (d if stage > 0 else 1)
            elif act == ADD and last is not None:
                last += c
            elif act == SUB and last is not None:
                last -= e if (stage == 2 and mem == "used") else (d if stage > 0 else 0)
            elif act == STO:
                mem = "used" if stage >= 1 else "ok"
            elif act == RCL and mem is not None:
                last = (d  e) if (stage == 2 and mem == "used") else (last if last else 0)
            elif act == ANS:
                target = [goal0, goal1, goal2, goal4][stage] if stage == 2 else [goal0, goal1, goal2][stage]
                correct = (last == target)
                if stage == 0:
                    shaped += 0.25  (last == goal0) + 0.5  (last == goal1)
                elif stage == 1:
                    shaped += 0.25  (last == goal0) + 0.5  (last == goal1) + 0.75  (last == goal2)
                elif stage == 2:
                    shaped += 0.2  (last == goal0) + 0.4  (last == goal1) + 0.6  (last == goal4) + 0.6  (last == goal3)
                return (1.0 if correct else 0.0) + 0.2  shaped, steps
            if steps >= self.maxsteps:
                break
        return 0.0, steps

Designing a Stage-Aware Actor-Critic Model for Reasoning

Next, we implement a policy network based on an actor-critic paradigm, utilizing a gated recurrent unit (GRU) to capture temporal dependencies. The model embeds both input tokens and task stages, enabling it to adjust its reasoning depth dynamically according to the complexity of the current task. This architecture empowers the agent to learn when and how to invoke internal tools within a cohesive neural framework.

class ActorCritic(nn.Module):
    def init(self, V, d=96, nstage=3):
        super().init()
        self.emb = nn.Embedding(V, d)
        self.stageemb = nn.Embedding(nstage, d)
        self.rnn = nn.GRU(d, d, 1, batchfirst=True)
        self.pi = nn.Linear(d, V)
        self.v = nn.Linear(d, 1)

    def forward(self, ctx, stage, maxlen=6, greedy=False):
        B = ctx.shape[0]
        ce = self.emb(ctx).mean(1) + self.stageemb(stage).unsqueeze(1)
        h = torch.tanh(ce.mean(1)).unsqueeze(0)
        inp = self.emb(torch.full((B, 1), CTX, device=device))
        acts, logps, ents, vals = [], [], [], []

        for  in range(maxlen):
            out, h = self.rnn(inp, h)
            val = self.v(out[:, -1])
            logits = self.pi(out[:, -1])
            pi = F.logsoftmax(logits, dim=-1).exp()
            ent = -(pi  torch.log(pi + 1e-9)).sum(1)
            a = torch.argmax(logits, 1) if greedy else torch.distributions.Categorical(pi).sample()
            logp = F.logsoftmax(logits, dim=-1).gather(1, a.unsqueeze(1)).squeeze(1)
            inp = self.emb(a.unsqueeze(1))
            acts.append(a)
            logps.append(logp)
            ents.append(ent)
            vals.append(val.squeeze(1))

        return torch.stack(acts, 1), torch.stack(logps, 1), torch.stack(ents, 1), torch.stack(vals, 1)

Training the Agent with Advantage Actor-Critic Reinforcement Learning

We train the agent using an advantage actor-critic (A2C) algorithm, optimizing both policy and value networks simultaneously. The training loop processes batches of synthetic problems, applying entropy regularization to encourage exploration and avoid premature convergence. This end-to-end learning framework allows the agent to refine its internal planning and tool-use strategies progressively.

env = ToolEnv()
net = ActorCritic(V).to(device)
opt = torch.optim.Adam(net.parameters(), lr=3e-4)

def padbatch(ctxs):
    L = max(len(c) + 1 for c in ctxs)
    out = torch.full((len(ctxs), L), EOS, dtype=torch.long, device=device)
    for i, c in enumerate(ctxs):
        out[i, :len(c) + 1] = torch.tensor(c + [CTX], device=device)
    return out

def runbatch(stage, batch=128, train=True, greedy=False):
    ctxs = []
    metas = []
    for  in range(batch):
        c, t, abc = env.sample(stage)
        ctxs.append(c)
        metas.append((t, abc))
    ctx = padbatch(ctxs)
    staget = torch.full((batch,), stage, device=device, dtype=torch.long)
    acts, logps, ents, vals = net(ctx, staget, maxlen=6, greedy=greedy)
    rewards = []
    for i in range(batch):
        traj = acts[i].tolist()
        abc = metas[i][1]
        r,  = env.stepseq(traj, abc, stage)
        rewards.append(r)
    R = torch.tensor(rewards, device=device).float()
    adv = (R - vals.sum(1)).detach()
    if not train:
        return R.mean().item(), 0.0
    pg = -(logps.sum(1)  adv).mean()
    vloss = F.mseloss(vals.sum(1), R)
    ent = -ents.mean()
    loss = pg + 0.5  vloss + 0.01  ent
    opt.zerograd()
    loss.backward()
    nn.utils.clipgradnorm(net.parameters(), 1.0)
    opt.step()
    return R.mean().item(), loss.item()

Curriculum Learning: Gradually Increasing Task Complexity

To facilitate effective learning, we employ a curriculum that gradually escalates task difficulty. The agent begins with simpler arithmetic problems and progressively tackles more complex multi-step reasoning tasks. Periodic evaluations across all stages monitor the agent’s generalization capabilities and improvement in internal planning.

print("Starting training...")
stages = [0, 0, 0, 1, 1, 2]

for ep in range(1, 61):
    stage = stages[min((ep - 1) // 10, len(stages) - 1)]
    acc, loss = runbatch(stage, batch=192, train=True)
    if ep % 5 == 0:
        with torch.nograd():
            evals = [runbatch(s, train=False, greedy=True)[0] for s in [0, 1, 2]]
        print(f"Epoch {ep:02d} | Stage {stage} | Training Accuracy: {acc:.3f} | "
              f"Eval T0: {evals[0]:.3f} | T1: {evals[1]:.3f} | T2: {evals[2]:.3f} | Loss: {loss:.3f}")

Analyzing Agent Behavior and Performance

After training, we examine the agent’s decision-making by generating example reasoning sequences. These sequences reveal the internal tool tokens selected by the model and confirm whether the agent arrives at the correct solution. We also report final evaluation accuracies, demonstrating the agent’s successful integration of planning, memory, and reasoning within a single neural system.

def explain(stage):
    c, t, abc = env.sample(stage)
    ctx = padbatch([c])
    staget = torch.tensor([stage], device=device)
    with torch.nograd():
        a, , ,  = net(ctx, staget, greedy=True)
    seq = [tok2str[x] for x in a[0].tolist()]
    r,  = env.stepseq(a[0].tolist(), abc, stage)
    return dict(stage=stage, context=c, target=t, actions=" ".join(seq), reward=round(float(r), 2))

with torch.nograd():
    for s in [0, 1, 2]:
        print(f"Stage {s} sample reasoning:")
        for  in range(5):
            print(explain(s))

with torch.nograd():
    finals = [run_batch(s, train=False, greedy=True, batch=1000)[0] for s in [0, 1, 2]]
print(f"Final greedy accuracies → Stage 0: {finals[0]:.3f}, Stage 1: {finals[1]:.3f}, Stage 2: {finals[2]:.3f}")

Summary: Towards End-to-End Learned Reasoning Agents

This exploration demonstrates that neural networks can internalize complex cognitive functions such as planning, memory management, and tool use when guided by reinforcement learning signals. Moving beyond traditional modular pipelines that separate memory, planning, and execution, our model-native agent learns these components as intertwined dynamics within a single architecture. This paradigm shift in agent design highlights the potential for emergent reasoning and autonomous decision-making without handcrafted control logic, paving the way for more adaptive and intelligent AI systems.

Exit mobile version