Integrating Planning, Memory, and Tool Use in a Unified Neural Agent
This guide delves into how an autonomous agent can internalize complex cognitive functions-such as planning, memory retention, and tool utilization-within a single neural framework, eliminating the need for external control mechanisms. We develop a streamlined, model-native agent that masters arithmetic reasoning tasks through reinforcement learning. By leveraging a stage-sensitive actor-critic architecture alongside a curriculum of progressively challenging problem environments, the agent learns to employ internal “tools” and short-term memory effectively to solve problems end-to-end. This stepwise approach reveals the agent’s evolution from basic reasoning to sophisticated multi-step compositional strategies.
Constructing a Symbolic Environment for Internal Tool Use
Our first step involves creating a synthetic environment where the agent interacts with symbolic tools representing arithmetic operations like multiplication, addition, and subtraction. Each action corresponds to an internal tool the agent can invoke. This setup simulates reasoning challenges requiring the agent to plan sequences of tool applications to derive correct answers. The environment supports three stages of increasing complexity, encouraging the agent to develop compositional reasoning skills.
import math, random, torch, torch.nn as nn, torch.nn.functional as F
device = "cuda" if torch.cuda.isavailable() else "cpu"
torch.manualseed(0)
random.seed(0)
V = 18 # Vocabulary size
CTX = 10
MUL, ADD, SUB, ANS, STO, RCL, EOS = 11, 12, 13, 14, 15, 16, 17
tok2str = {*{i: str(i) for i in range(10)},
CTX: "[CTX]", MUL: "[MUL]", ADD: "[ADD]", SUB: "[SUB]",
ANS: "[ANS]", STO: "[STO]", RCL: "[RCL]", EOS: "[EOS]"}
class ToolEnv:
def init(self, maxsteps=7):
self.maxsteps = maxsteps
def sample(self, stage):
a, b, c, d, e = [random.randint(0, 9) for in range(5)]
if stage == 0:
ctx = [a, b, c]
target = a b + c
elif stage == 1:
ctx = [a, b, c, d]
target = (a b + c) - d
else:
ctx = [a, b, c, d, e]
target = (a b + c) - (d e)
return ctx, target, (a, b, c, d, e)
def stepseq(self, actions, abc, stage):
a, b, c, d, e = abc
last = None
mem = None
steps = 0
shaped = 0.0
goal0 = a b
goal1 = goal0 + c
goal2 = goal1 - d
goal3 = d e
goal4 = goal1 - goal3
for act in actions:
steps += 1
if act == MUL:
last = a b if last is None else last (d if stage > 0 else 1)
elif act == ADD and last is not None:
last += c
elif act == SUB and last is not None:
last -= e if (stage == 2 and mem == "used") else (d if stage > 0 else 0)
elif act == STO:
mem = "used" if stage >= 1 else "ok"
elif act == RCL and mem is not None:
last = (d e) if (stage == 2 and mem == "used") else (last if last else 0)
elif act == ANS:
target = [goal0, goal1, goal2, goal4][stage] if stage == 2 else [goal0, goal1, goal2][stage]
correct = (last == target)
if stage == 0:
shaped += 0.25 (last == goal0) + 0.5 (last == goal1)
elif stage == 1:
shaped += 0.25 (last == goal0) + 0.5 (last == goal1) + 0.75 (last == goal2)
elif stage == 2:
shaped += 0.2 (last == goal0) + 0.4 (last == goal1) + 0.6 (last == goal4) + 0.6 (last == goal3)
return (1.0 if correct else 0.0) + 0.2 shaped, steps
if steps >= self.maxsteps:
break
return 0.0, steps
Designing a Stage-Aware Actor-Critic Model for Reasoning
Next, we implement a policy network based on an actor-critic paradigm, utilizing a gated recurrent unit (GRU) to capture temporal dependencies. The model embeds both input tokens and task stages, enabling it to adjust its reasoning depth dynamically according to the complexity of the current task. This architecture empowers the agent to learn when and how to invoke internal tools within a cohesive neural framework.
class ActorCritic(nn.Module):
def init(self, V, d=96, nstage=3):
super().init()
self.emb = nn.Embedding(V, d)
self.stageemb = nn.Embedding(nstage, d)
self.rnn = nn.GRU(d, d, 1, batchfirst=True)
self.pi = nn.Linear(d, V)
self.v = nn.Linear(d, 1)
def forward(self, ctx, stage, maxlen=6, greedy=False):
B = ctx.shape[0]
ce = self.emb(ctx).mean(1) + self.stageemb(stage).unsqueeze(1)
h = torch.tanh(ce.mean(1)).unsqueeze(0)
inp = self.emb(torch.full((B, 1), CTX, device=device))
acts, logps, ents, vals = [], [], [], []
for in range(maxlen):
out, h = self.rnn(inp, h)
val = self.v(out[:, -1])
logits = self.pi(out[:, -1])
pi = F.logsoftmax(logits, dim=-1).exp()
ent = -(pi torch.log(pi + 1e-9)).sum(1)
a = torch.argmax(logits, 1) if greedy else torch.distributions.Categorical(pi).sample()
logp = F.logsoftmax(logits, dim=-1).gather(1, a.unsqueeze(1)).squeeze(1)
inp = self.emb(a.unsqueeze(1))
acts.append(a)
logps.append(logp)
ents.append(ent)
vals.append(val.squeeze(1))
return torch.stack(acts, 1), torch.stack(logps, 1), torch.stack(ents, 1), torch.stack(vals, 1)
Training the Agent with Advantage Actor-Critic Reinforcement Learning
We train the agent using an advantage actor-critic (A2C) algorithm, optimizing both policy and value networks simultaneously. The training loop processes batches of synthetic problems, applying entropy regularization to encourage exploration and avoid premature convergence. This end-to-end learning framework allows the agent to refine its internal planning and tool-use strategies progressively.
env = ToolEnv()
net = ActorCritic(V).to(device)
opt = torch.optim.Adam(net.parameters(), lr=3e-4)
def padbatch(ctxs):
L = max(len(c) + 1 for c in ctxs)
out = torch.full((len(ctxs), L), EOS, dtype=torch.long, device=device)
for i, c in enumerate(ctxs):
out[i, :len(c) + 1] = torch.tensor(c + [CTX], device=device)
return out
def runbatch(stage, batch=128, train=True, greedy=False):
ctxs = []
metas = []
for in range(batch):
c, t, abc = env.sample(stage)
ctxs.append(c)
metas.append((t, abc))
ctx = padbatch(ctxs)
staget = torch.full((batch,), stage, device=device, dtype=torch.long)
acts, logps, ents, vals = net(ctx, staget, maxlen=6, greedy=greedy)
rewards = []
for i in range(batch):
traj = acts[i].tolist()
abc = metas[i][1]
r, = env.stepseq(traj, abc, stage)
rewards.append(r)
R = torch.tensor(rewards, device=device).float()
adv = (R - vals.sum(1)).detach()
if not train:
return R.mean().item(), 0.0
pg = -(logps.sum(1) adv).mean()
vloss = F.mseloss(vals.sum(1), R)
ent = -ents.mean()
loss = pg + 0.5 vloss + 0.01 ent
opt.zerograd()
loss.backward()
nn.utils.clipgradnorm(net.parameters(), 1.0)
opt.step()
return R.mean().item(), loss.item()
Curriculum Learning: Gradually Increasing Task Complexity
To facilitate effective learning, we employ a curriculum that gradually escalates task difficulty. The agent begins with simpler arithmetic problems and progressively tackles more complex multi-step reasoning tasks. Periodic evaluations across all stages monitor the agent’s generalization capabilities and improvement in internal planning.
print("Starting training...")
stages = [0, 0, 0, 1, 1, 2]
for ep in range(1, 61):
stage = stages[min((ep - 1) // 10, len(stages) - 1)]
acc, loss = runbatch(stage, batch=192, train=True)
if ep % 5 == 0:
with torch.nograd():
evals = [runbatch(s, train=False, greedy=True)[0] for s in [0, 1, 2]]
print(f"Epoch {ep:02d} | Stage {stage} | Training Accuracy: {acc:.3f} | "
f"Eval T0: {evals[0]:.3f} | T1: {evals[1]:.3f} | T2: {evals[2]:.3f} | Loss: {loss:.3f}")
Analyzing Agent Behavior and Performance
After training, we examine the agent’s decision-making by generating example reasoning sequences. These sequences reveal the internal tool tokens selected by the model and confirm whether the agent arrives at the correct solution. We also report final evaluation accuracies, demonstrating the agent’s successful integration of planning, memory, and reasoning within a single neural system.
def explain(stage):
c, t, abc = env.sample(stage)
ctx = padbatch([c])
staget = torch.tensor([stage], device=device)
with torch.nograd():
a, , , = net(ctx, staget, greedy=True)
seq = [tok2str[x] for x in a[0].tolist()]
r, = env.stepseq(a[0].tolist(), abc, stage)
return dict(stage=stage, context=c, target=t, actions=" ".join(seq), reward=round(float(r), 2))
with torch.nograd():
for s in [0, 1, 2]:
print(f"Stage {s} sample reasoning:")
for in range(5):
print(explain(s))
with torch.nograd():
finals = [run_batch(s, train=False, greedy=True, batch=1000)[0] for s in [0, 1, 2]]
print(f"Final greedy accuracies → Stage 0: {finals[0]:.3f}, Stage 1: {finals[1]:.3f}, Stage 2: {finals[2]:.3f}")
Summary: Towards End-to-End Learned Reasoning Agents
This exploration demonstrates that neural networks can internalize complex cognitive functions such as planning, memory management, and tool use when guided by reinforcement learning signals. Moving beyond traditional modular pipelines that separate memory, planning, and execution, our model-native agent learns these components as intertwined dynamics within a single architecture. This paradigm shift in agent design highlights the potential for emergent reasoning and autonomous decision-making without handcrafted control logic, paving the way for more adaptive and intelligent AI systems.
