How to Build an Advanced AI Agent with Summarized Short-Term and Vector-Based Long-Term Memory

Creating a Smart AI Agent with Persistent Memory

This guide demonstrates how to develop an intelligent AI assistant capable of not only engaging in conversations but also retaining and recalling information over time. Starting from the ground up, we integrate a compact language model, FAISS vector search, and a summarization system to implement both short-term and long-term memory functionalities. By leveraging embeddings and distilled knowledge, the agent dynamically adapts to user instructions, remembers critical details across sessions, and efficiently condenses context to maintain smooth interactions.

Setting Up the Environment and Dependencies

!pip install -q transformers accelerate bitsandbytes sentence-transformers faiss-cpu

import os
import json
import time
import uuid
import re
from datetime import datetime
import torch
import faiss
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

We begin by installing the necessary libraries and importing modules essential for our AI agent. The environment is configured to detect GPU availability, enabling optimized model execution whether on CUDA-enabled hardware or CPU.

Loading the Language Model with Hardware-Aware Optimization

def load_language_model(model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"):
    try:
        if DEVICE == "cuda":
            quant_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_compute_dtype=torch.bfloat16,
                bnb_4bit_quant_type="nf4"
            )
            tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
            model = AutoModelForCausalLM.from_pretrained(
                model_name,
                quantization_config=quant_config,
                device_map="auto"
            )
        else:
            tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
            model = AutoModelForCausalLM.from_pretrained(
                model_name,
                torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
                low_cpu_mem_usage=True
            )
        return pipeline("text-generation", model=model, tokenizer=tokenizer, device=0 if DEVICE == "cuda" else -1, do_sample=True)
    except Exception as error:
        raise RuntimeError(f"Error loading language model: {error}")

This function loads a lightweight language model optimized for the available hardware. On GPUs, it employs 4-bit quantization to reduce memory usage and speed up inference. On CPUs, it uses appropriate data types to balance performance and resource consumption.

Implementing Persistent Vector-Based Memory

class PersistentVectorMemory:
    def __init__(self, storage_path="agent_memory.json", embedding_dim=384):
        self.storage_path = storage_path
        self.embedding_dim = embedding_dim
        self.records = []
        self.embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device=DEVICE)
        self.index = faiss.IndexFlatIP(embedding_dim)
        if os.path.exists(storage_path):
            data = json.load(open(storage_path))
            self.records = data.get("items", [])
            if self.records:
                embeddings = torch.tensor([item["embedding"] for item in self.records], dtype=torch.float32).numpy()
                self.index.add(embeddings)

    def _embed_text(self, text):
        vector = self.embedder.encode([text], normalize_embeddings=True)[0]
        return vector.tolist()

    def add_memory(self, text, metadata=None):
        embedding = self._embed_text(text)
        self.index.add(torch.tensor([embedding]).numpy())
        record = {
            "id": str(uuid.uuid4()),
            "text": text,
            "metadata": metadata or {},
            "embedding": embedding
        }
        self.records.append(record)
        self._save_to_disk()
        return record["id"]

    def search_memory(self, query, top_k=5, similarity_threshold=0.25):
        if not self.records:
            return []
        query_vector = self.embedder.encode([query], normalize_embeddings=True)
        distances, indices = self.index.search(query_vector, min(top_k, len(self.records)))
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            if idx == -1:
                continue
            if dist >= similarity_threshold:
                results.append((dist, self.records[idx]))
        return results

    def _save_to_disk(self):
        simplified_records = [{k: v for k, v in record.items()} for record in self.records]
        with open(self.storage_path, "w") as f:
            json.dump({"items": simplified_records}, f, indent=2)

This class manages the agent’s long-term memory by storing conversation snippets as vector embeddings. Using the MiniLM model for embedding and FAISS for efficient similarity search, it enables quick retrieval of relevant past information. Memory is persisted to disk, ensuring continuity across sessions.

Utility Functions for Time and Text Handling

def current_timestamp_iso():
    return datetime.now().isoformat(timespec="seconds")

def truncate_text(text, max_length=1600):
    return text if len(text) <= max_length else text[:max_length] + " …"

def extract_json_from_text(text):
    match = re.search(r"{.*}", text, flags=re.S)
    return match.group(0) if match else None

Defining System Instructions and Prompts

We establish clear guidelines and prompts to steer the agent's behavior and memory management:

  • System Directive: The assistant should be concise, helpful, and prioritize factual information from memory over speculation.
  • Conversation Summarization Prompt: Summarizes recent dialogue into bullet points emphasizing stable facts and tasks.
  • Memory Distillation Prompt: Determines if user input contains valuable long-term information and returns a compact JSON indicating whether to save it.

Building the Memory-Enabled Conversational Agent

class ConversationalMemoryAgent:
    def __init__(self):
        self.llm = load_language_model()
        self.memory = PersistentVectorMemory()
        self.dialogue_history = []
        self.conversation_summary = ""
        self.max_history_length = 10

    def _generate_response(self, prompt, max_tokens=256, temperature=0.7):
        output = self.llm(
            prompt,
            max_new_tokens=max_tokens,
            temperature=temperature,
            top_p=0.95,
            num_return_sequences=1,
            pad_token_id=self.llm.tokenizer.eos_token_id
        )[0]["generated_text"]
        return output[len(prompt):].strip() if output.startswith(prompt) else output.strip()

    def _compose_prompt(self, user_input, memory_context):
        recent_dialogue = "n".join([f"{role.upper()}: {text}" for role, text in self.dialogue_history[-8:]])
        system_message = f"System: You are a concise assistant with memory capabilities.nTime: {current_timestamp_iso()}nn"
        memory_section = f"MEMORY (relevant excerpts):n{memory_context}nn" if memory_context else ""
        summary_section = f"CONTEXT SUMMARY:n{self.conversation_summary}nn" if self.conversation_summary else ""
        return system_message + memory_section + summary_section + recent_dialogue + f"nUSER: {user_input}nASSISTANT:"

    def _distill_and_store_memory(self, user_input):
        try:
            raw_output = self._generate_response(
                f"""Decide if the USER text contains durable info worth long-term memory (preferences, identity, projects, deadlines, facts).
Return compact JSON only: {{"save": true/false, "memory": "one-sentence memory"}}.
USER: {user_input}""",
                max_tokens=120,
                temperature=0.1
            )
            json_str = extract_json_from_text(raw_output)
            if json_str:
                data = json.loads(json_str)
                if data.get("save") and data.get("memory"):
                    self.memory.add_memory(data["memory"], {"timestamp": current_timestamp_iso(), "source": "distilled"})
                    return True, data["memory"]
        except Exception:
            pass

        # Heuristic fallback for common personal info patterns
        if re.search(r"b(my name is|call me|I like|deadline|due|email|phone|working on|prefer|timezone|birthday|goal|exam)b", user_input, flags=re.I):
            snippet = f"User said: {truncate_text(user_input, 120)}"
            self.memory.add_memory(snippet, {"timestamp": current_timestamp_iso(), "source": "heuristic"})
            return True, snippet

        return False, ""

    def _update_summary_if_needed(self):
        if len(self.dialogue_history) > self.max_history_length:
            full_conversation = "n".join([f"{role}: {text}" for role, text in self.dialogue_history])
            summary = self._generate_response(
                f"Summarize the conversation below in 4-6 bullet points focusing on stable facts and tasks:nn{truncate_text(full_conversation, 3500)}nnSummary:",
                max_tokens=180,
                temperature=0.2
            )
            self.conversation_summary = summary
            self.dialogue_history = self.dialogue_history[-4:]

    def recall_relevant_memory(self, query, top_k=5):
        results = self.memory.search_memory(query, top_k=top_k)
        return "n".join([f"- ({score:.2f}) {item['text']} [meta={item['metadata']}]" for score, item in results])

    def interact(self, user_input):
        self.dialogue_history.append(("user", user_input))
        saved, memory_note = self._distill_and_store_memory(user_input)
        memory_context = self.recall_relevant_memory(user_input, top_k=6)
        prompt = self._compose_prompt(user_input, memory_context)
        response = self._generate_response(prompt)
        self.dialogue_history.append(("assistant", response))
        self._update_summary_if_needed()
        status_message = f"💾 memory_saved: {saved}; " + (f"note: {memory_note}" if saved else "note: -")
        print(f"nUSER: {user_input}nASSISTANT: {response}n{status_message}")
        return response

This class encapsulates the entire conversational agent, combining language generation, memory storage, and context summarization. It distills important user inputs into long-term memory, recalls relevant past information to enrich responses, and summarizes dialogue history to keep short-term context manageable.

Testing the Memory-Enabled Agent

agent = ConversationalMemoryAgent()

print("✅ Agent initialized. Sample interactions:n")
agent.interact("Hello! My name is Alexandra, but you can call me Alex. I'm studying for the CFA exam in 2025.")
agent.interact("I work in marketing at a tech startup and appreciate brief, clear answers.")
agent.interact("Can you remind me of my exam year and how you should address me?")
agent.interact("By the way, I enjoy tutorials that combine retrieval-augmented generation with simple deployment.")
agent.interact("Based on my preferences, what should I focus on this week in my studies?")

We instantiate the agent and engage it with several example inputs to seed its memory and verify its ability to recall and adapt. The agent remembers the preferred name, exam timeline, and communication style, tailoring its responses accordingly.

Summary and Future Directions

Empowering an AI assistant with memory capabilities significantly enhances its usefulness and personalization. By storing key facts, recalling them when relevant, and summarizing ongoing conversations, the agent maintains coherent and context-aware interactions. This foundation opens avenues for expanding memory schemas, integrating richer knowledge bases, and experimenting with more sophisticated memory-augmented AI architectures.

More from this stream

Recomended