Build an Autonomous Wet-Lab Protocol Planner and Validator Using Salesforce CodeGen for Agentic Experiment Design and Safety Optimization

Developing an Intelligent Wet-Lab Protocol Planner and Validator

This guide walks you through creating a smart Wet-Lab Protocol Planner & Validator designed to streamline experimental design and execution. Built with Python, the system leverages advanced natural language processing to interpret and optimize laboratory protocols. The architecture is divided into distinct modules: ProtocolParser extracts structured information such as procedural steps, durations, and temperature conditions from textual protocols; InventoryManager verifies reagent availability and expiration; SchedulePlanner constructs efficient timelines and identifies opportunities for parallel task execution; and SafetyValidator flags potential biosafety or chemical hazards. An integrated large language model (LLM) then provides intelligent recommendations to enhance protocol efficiency, closing the loop between analysis, planning, validation, and refinement.

Setting Up the Environment and Loading the Model

We start by importing necessary Python libraries and loading the Salesforce CodeGen-350M-mono model locally. This approach enables lightweight inference without relying on external APIs. The tokenizer and model are initialized with 16-bit floating-point precision and automatic device mapping to maximize performance on GPU-enabled environments such as Google Colab.

import re
import json
import pandas as pd
from datetime import datetime, timedelta
from collections import defaultdict
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODELNAME = "Salesforce/codegen-350M-mono"
print("Loading CodeGen model (approx. 30 seconds)...")
tokenizer = AutoTokenizer.frompretrained(MODELNAME)
tokenizer.padtoken = tokenizer.eostoken
model = AutoModelForCausalLM.frompretrained(
    MODELNAME, torchdtype=torch.float16, devicemap="auto"
)
print("✔ Model successfully loaded!")

Parsing Protocols: Extracting Key Experimental Details

The ProtocolParser class processes raw protocol text to identify individual steps, their durations, temperature requirements, and safety considerations. It uses regular expressions to detect step numbers and descriptions, then analyzes the surrounding context to extract timing and temperature data. Safety flags such as biosafety levels, chemical hazards, and light sensitivity are also identified to ensure compliance with laboratory standards.

class ProtocolParser:
    def readprotocol(self, text):
        steps = []
        lines = text.split('n')
        for i, line in enumerate(lines, 1):
            stepmatch = re.search(r'^(d+).s+(.+)', line.strip())
            if stepmatch:
                num, name = stepmatch.groups()
                context = 'n'.join(lines[i:min(i+4, len(lines))])
                duration = self.extractduration(context)
                temp = self.extracttemp(context)
                safety = self.checksafety(context)
                steps.append({
                    'step': int(num),
                    'name': name,
                    'durationmin': duration,
                    'temp': temp,
                    'safety': safety,
                    'line': i,
                    'details': context[:200]
                })
        return steps

    def extractduration(self, text):
        text = text.lower()
        if 'overnight' in text:
            return 720  # 12 hours
        match = re.search(r'(d+)s(?:hour|hr|h)s?(?!w)', text)
        if match:
            return int(match.group(1))  60
        match = re.search(r'(d+)s(?:min|minute)s?', text)
        if match:
            return int(match.group(1))
        match = re.search(r'(d+)-(d+)s(?:min|minute)', text)
        if match:
            return (int(match.group(1)) + int(match.group(2))) // 2
        return 30  # Default duration

    def extracttemp(self, text):
        text = text.lower()
        if '4°c' in text or '4 °c' in text or '4°' in text:
            return '4°C'
        if '37°c' in text or '37 °c' in text:
            return '37°C'
        if '-20°c' in text or '-80°c' in text:
            return 'FREEZER'
        if 'room temp' in text or 'rt' in text or 'ambient' in text:
            return 'RT'
        return 'RT'

    def checksafety(self, text):
        flags = []
        textlower = text.lower()
        if re.search(r'bsl-[23]|biosafety', textlower):
            flags.append('BSL-2/3')
        if re.search(r'caution|corrosive|hazard|toxic', textlower):
            flags.append('HAZARD')
        if 'sharp' in textlower or 'needle' in textlower:
            flags.append('SHARPS')
        if 'dark' in textlower or 'light-sensitive' in textlower:
            flags.append('LIGHT-SENSITIVE')
        if 'flammable' in textlower:
            flags.append('FLAMMABLE')
        return flags

Managing Reagent Inventory with Fuzzy Matching and Expiry Checks

The InventoryManager class loads reagent stock data from CSV format and verifies the availability and expiration status of reagents required by the protocol. It employs fuzzy matching to handle variations in reagent naming and flags low stock levels or imminent expiry to prevent experimental delays.

class InventoryManager:
    def init(self, csvtext):
        from io import StringIO
        self.df = pd.readcsv(StringIO(csvtext))
        self.df['expiry'] = pd.todatetime(self.df['expiry'])

    def checkavailability(self, reagentlist):
        issues = []
        for reagent in reagentlist:
            reagentclean = reagent.lower().replace('', ' ').replace('-', ' ')
            matches = self.df[self.df['reagent'].str.lower().str.contains(
                '|'.join(reagentclean.split()[:2]), na=False, regex=True
            )]
            if matches.empty:
                issues.append(f"❌ {reagent}: NOT FOUND IN INVENTORY")
            else:
                row = matches.iloc[0]
                if row['expiry'] < datetime.now():
                    issues.append(f"⚠️ {reagent}: EXPIRED on {row['expiry'].date()} (lot {row['lot']})")
                elif (row['expiry'] - datetime.now()).days < 30:
                    issues.append(f"⚠️ {reagent}: EXPIRING SOON ({row['expiry'].date()}, lot {row['lot']})")
                if row['quantity'] < 10:
                    issues.append(f"⚠️ {reagent}: LOW STOCK ({row['quantity']} {row['unit']} left)")
        return issues

    def extractreagents(self, protocoltext):
        reagents = set()
        patterns = [
            r'b([A-Z][a-z]+(?:s+[A-Z][a-z]+))s+(?:antibody|buffer|solution)',
            r'b([A-Z]{2,}(?:-[A-Z0-9]+)?)b',
            r'(?:add|use|prepare|dilute)s+([a-z-]+(?:antibody|buffer|substrate|solution))',
        ]
        for pattern in patterns:
            matches = re.findall(pattern, protocoltext, re.IGNORECASE)
            reagents.update(m.strip() for m in matches if len(m) > 2)
        return list(reagents)[:15]

Designing Efficient Experiment Schedules and Parallelization

The SchedulePlanner module constructs a detailed timeline for the protocol, starting from a specified time. It accounts for long-duration steps by allocating them to subsequent days and identifies steps that can be performed simultaneously to save time. This optimization is crucial for maximizing lab throughput and minimizing idle periods.

class SchedulePlanner:
    def makeschedule(self, steps, starttime="09:00"):
        schedule = []
        current = datetime.strptime(f"2025-01-01 {starttime}", "%Y-%m-%d %H:%M")
        day = 1
        for step in steps:
            end = current + timedelta(minutes=step['durationmin'])
            if step['durationmin'] > 480:  # Longer than 8 hours
                day += 1
                current = datetime.strptime(f"2025-01-0{day} 09:00", "%Y-%m-%d %H:%M")
                end = current
            schedule.append({
                'step': step['step'],
                'name': step['name'][:40],
                'start': current.strftime("%H:%M"),
                'end': end.strftime("%H:%M"),
                'duration': step['durationmin'],
                'temp': step['temp'],
                'day': day,
                'canparallelize': step['durationmin'] > 60,
                'safety': ', '.join(step['safety']) if step['safety'] else 'None'
            })
            if step['durationmin'] <= 480:
                current = end
        return schedule

    def optimizeparallelization(self, schedule):
        parallelgroups = []
        idletimesaved = 0
        for i, step in enumerate(schedule):
            if step['canparallelize'] and i + 1 < len(schedule):
                nextstep = schedule[i + 1]
                if step['temp'] == nextstep['temp']:
                    saved = min(step['duration'], nextstep['duration'])
                    parallelgroups.append(
                        f"✨ Steps {step['step']} & {nextstep['step']} can overlap → Save {saved} minutes"
                    )
                    idletimesaved += saved
        return parallelgroups, idletimesaved

Ensuring Laboratory Safety Through Automated Validation

The SafetyValidator class enforces safety protocols by checking for out-of-range pH values, biosafety level requirements, and chemical hazards. It issues warnings for steps involving sharps, light-sensitive reagents, or flammable substances, helping to maintain a secure working environment.

class SafetyValidator:
    RULES = {
        'phrange': (5.0, 11.0),
        'templimits': {'4°C': (2, 8), '37°C': (35, 39), 'RT': (20, 25)},
        'maxconcurrentinstruments': 3,
    }

    def validate(self, steps):
        risks = []
        for step in steps:
            phmatch = re.search(r'phs(d+.?d)', step['details'].lower())
            if phmatch:
                ph = float(phmatch.group(1))
                if not (self.RULES['phrange'][0] <= ph <= self.RULES['phrange'][1]):
                    risks.append(f"⚠️ Step {step['step']}: pH {ph} OUTSIDE SAFE RANGE")
            if 'BSL-2/3' in step['safety']:
                risks.append(f"🛡️ Step {step['step']}: BSL-2 cabinet REQUIRED")
            if 'HAZARD' in step['safety']:
                risks.append(f"🧪 Step {step['step']}: Full PPE and chemical hood REQUIRED")
            if 'SHARPS' in step['safety']:
                risks.append(f"💉 Step {step['step']}: Use sharps container and needle safety protocols")
            if 'LIGHT-SENSITIVE' in step['safety']:
                risks.append(f"🌙 Step {step['step']}: Perform in dark or amber tubes")
        return risks

Integrating AI for Protocol Optimization and Refinement

The core agent loop orchestrates the entire workflow, from parsing the protocol and checking inventory to scheduling and safety validation. It calls the CodeGen model to generate optimization suggestions, such as batching similar temperature steps or pre-warming instruments, enhancing overall efficiency.

def llmcall(prompt, maxtokens=200):
    try:
        inputs = tokenizer(prompt, returntensors="pt", truncation=True, maxlength=512).to(model.device)
        outputs = model.generate(
            inputs,
            maxnewtokens=maxtokens,
            dosample=True,
            temperature=0.7,
            topp=0.9,
            padtokenid=tokenizer.eostokenid
        )
        return tokenizer.decode(outputs[0], skipspecialtokens=True)[len(prompt):].strip()
    except Exception:
        return "Consider batching steps with similar temperatures and pre-warming instruments."

def agentloop(protocoltext, inventorycsv, starttime="09:00"):
    print("n🔬 Starting protocol analysis...n")
    parser = ProtocolParser()
    steps = parser.readprotocol(protocoltext)
    print(f"📄 Extracted {len(steps)} protocol steps")
    inventory = InventoryManager(inventorycsv)
    reagents = inventory.extractreagents(protocoltext)
    print(f"🧪 Detected {len(reagents)} reagents: {', '.join(reagents[:5])}...")
    invissues = inventory.checkavailability(reagents)
    validator = SafetyValidator()
    safetyrisks = validator.validate(steps)
    planner = SchedulePlanner()
    schedule = planner.makeschedule(steps, starttime)
    parallelopts, timesaved = planner.optimizeparallelization(schedule)
    totaltime = sum(s['duration'] for s in schedule)
    optimizedtime = totaltime - timesaved
    optprompt = f"Protocol contains {len(steps)} steps totaling {totaltime} minutes. Suggest key bottleneck optimizations:"
    optimization = llmcall(optprompt, maxtokens=80)
    return {
        'steps': steps,
        'schedule': schedule,
        'inventoryissues': invissues,
        'safetyrisks': safetyrisks,
        'parallelization': parallelopts,
        'timesaved': timesaved,
        'totaltime': totaltime,
        'optimizedtime': optimizedtime,
        'aioptimization': optimization,
        'reagents': reagents
    }

Generating User-Friendly Outputs: Checklists and Gantt Charts

To facilitate practical use, the system converts results into Markdown checklists and CSV files compatible with Gantt chart tools. These outputs summarize step timings, reagent pick-lists, safety alerts, and optimization tips, providing clear guidance for laboratory personnel.

def generatechecklist(results):
    md = "# 🔬 Wet-Lab Protocol Checklistnn"
    md += f"Total Steps: {len(results['schedule'])}n"
    md += f"Estimated Duration: {results['totaltime']} minutes ({results['totaltime']//60}h {results['totaltime']%60}m)n"
    md += f"Optimized Duration: {results['optimizedtime']} minutes (time saved: {results['timesaved']} minutes)nn"
    md += "## ⏱️ Timelinen"
    currentday = 1
    for item in results['schedule']:
        if item['day'] > currentday:
            md += f"n### Day {item['day']}n"
            currentday = item['day']
        parallelicon = " 🔄" if item['canparallelize'] else ""
        md += f"- [ ] {item['start']}-{item['end']} | Step {item['step']}: {item['name']} ({item['temp']}){parallelicon}n"
    md += "n## 🧪 Reagent Pick-Listn"
    for reagent in results['reagents']:
        md += f"- [ ] {reagent}n"
    md += "n## ⚠️ Safety & Inventory Alertsn"
    allissues = results['safetyrisks'] + results['inventoryissues']
    if allissues:
        for issue in allissues:
            md += f"- {issue}n"
    else:
        md += "- ✅ No critical issues detectedn"
    md += "n## ✨ Optimization Tipsn"
    for tip in results['parallelization']:
        md += f"- {tip}n"
    md += f"- 💡 AI Suggestion: {results['aioptimization']}n"
    return md

def generateganttcsv(schedule):
    df = pd.DataFrame(schedule)
    return df.tocsv(index=False)

Demonstration with a Sample ELISA Protocol

We validate the system using a sample ELISA protocol for cytokine detection alongside a reagent inventory dataset. The agent parses the protocol, checks reagent stocks, schedules steps, identifies parallelization opportunities, and outputs a comprehensive checklist and Gantt chart data. This example highlights the planner’s capability to act as an autonomous, intelligent assistant in laboratory settings.

SAMPLEPROTOCOL = """ELISA Protocol for Cytokine Detection

  1. Coating (Day 1, 4°C overnight)
- Dilute capture antibody to 2 µg/mL in coating buffer (pH 9.6) - Add 100 µL per well to 96-well plate - Incubate at 4°C overnight (12-16 hours) - BSL-2 cabinet required
  1. Blocking (Day 2)
- Wash plate 3× with PBS-T (200 µL/well) - Add 200 µL blocking buffer (1% BSA in PBS) - Incubate 1 hour at room temperature
  1. Sample Incubation
- Wash 3× with PBS-T - Add 100 µL diluted samples/standards - Incubate 2 hours at room temperature
  1. Detection Antibody
- Wash 5× with PBS-T - Add 100 µL biotinylated detection antibody (0.5 µg/mL) - Incubate 1 hour at room temperature
  1. Streptavidin-HRP
- Wash 5× with PBS-T - Add 100 µL streptavidin-HRP (1:1000 dilution) - Incubate 30 minutes at room temperature - Work in dark
  1. Development
- Wash 7× with PBS-T - Add 100 µL TMB substrate - Incubate 10-15 minutes (monitor color development) - Add 50 µL stop solution (2M H2SO4) - CAUTION: corrosive """ SAMPLE
INVENTORY = """reagent,quantity,unit,expiry,lot capture antibody,500,µg,2025-12-31,AB123 blocking buffer,500,mL,2025-11-30,BB456 PBS-T,1000,mL,2026-01-15,PT789 detection antibody,8,µg,2025-10-15,DA321 streptavidin HRP,10,mL,2025-12-01,SH654 TMB substrate,100,mL,2025-11-20,TM987 stop solution,250,mL,2026-03-01,SS147 BSA,100,g,2024-09-30,BS741""" results = agentloop(SAMPLEPROTOCOL, SAMPLEINVENTORY, starttime="09:00") print("n" + "="
70) print(generatechecklist(results)) print("n" + "="*70) print("n📊 Gantt CSV Preview (first 400 characters):n") print(generateganttcsv(results['schedule'])[:400]) print("n🎯 Time Savings Achieved:", f"{results['timesaved']} minutes through parallelization")

Conclusion: Advancing Wet-Lab Automation with AI-Driven Protocol Planning

This project illustrates how agentic AI can significantly improve reproducibility, safety, and efficiency in wet-lab workflows. By converting unstructured experimental text into actionable, validated plans, the system automates critical tasks such as reagent management, safety compliance, and scheduling optimization. The integration of on-device CodeGen reasoning ensures data privacy and enables intelligent bottleneck identification and mitigation. The resulting planner produces detailed Gantt charts, Markdown checklists, and AI-generated optimization advice, laying the groundwork for fully autonomous laboratory planning solutions.

More from this stream

Recomended