Building a Regression Language Model: Predicting Numerical Values from Text
In this guide, we will develop a Regression Language Model (RLM) that directly forecasts continuous numerical outputs from textual inputs. Unlike traditional models that classify or generate text, our focus is on training a transformer-based architecture to uncover and learn the quantitative relationships embedded within natural language statements. We will start by synthesizing a dataset of text-to-number pairs, tokenize the data effectively, and then train a compact Transformer encoder to translate linguistic signals into real-valued predictions. By the end, you will gain a comprehensive understanding of how to implement RLMs from the ground up, visualize their training dynamics, and evaluate their performance on new, unseen inputs.
Essential Libraries and Environment Setup
First, we import key Python libraries such as PyTorch for deep learning, NumPy for numerical operations, and Matplotlib for plotting results. To ensure reproducibility of our experiments, we set fixed random seeds. This setup guarantees that every run produces consistent outcomes, which is crucial for debugging and benchmarking.
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from collections import Counter
import re
torch.manual_seed(42)
np.random.seed(42)
Creating a Synthetic Dataset for Text-to-Number Regression
To train our model without relying on external datasets, we generate synthetic examples that pair natural language sentences with corresponding numerical values. We use a variety of sentence templates representing different contexts such as temperatures, ratings, prices, percentages, speeds, and distances. Each template includes a transformation function to scale or normalize the target value appropriately. This diverse set of examples helps the model learn a broad range of text-to-number mappings.
def generate_synthetic_data(num_samples=2000):
"""Create synthetic text and numerical value pairs for regression."""
templates = [
("The temperature is {} degrees", lambda x: x),
("I give this a rating of {} out of ten", lambda x: x),
("The cost amounts to {} dollars", lambda x: x),
("Confidence level: {}", lambda x: x / 100),
("Traveling at {} kilometers per hour", lambda x: x / 10),
("{} percent completed", lambda x: x / 100),
("Achieved {} points in the match", lambda x: x / 10),
("The length is {} meters", lambda x: x),
]
data = []
for _ in range(num_samples):
template, transform = templates[np.random.randint(len(templates))]
value = np.random.uniform(0, 100)
sentence = template.format(round(value, 1))
target_value = transform(value)
data.append((sentence, target_value))
return data
Tokenizing Text: Converting Words to Numerical Indices
To feed textual data into our model, we need to convert sentences into sequences of numerical tokens. We implement a straightforward tokenizer that builds a vocabulary from the training texts, assigning each unique word an index. It also handles unknown words and pads sequences to a fixed length, ensuring uniform input sizes for the model.
class SimpleTokenizer:
def __init__(self):
self.word2idx = {"": 0, "": 1}
self.idx2word = {0: "", 1: ""}
self.vocab_size = 2
def fit(self, texts):
"""Construct vocabulary from a list of texts."""
words = []
for text in texts:
words.extend(re.findall(r'w+|[^sw]', text.lower()))
word_counts = Counter(words)
for word, _ in word_counts.most_common():
if word not in self.word2idx:
self.word2idx[word] = self.vocab_size
self.idx2word[self.vocab_size] = word
self.vocab_size += 1
def encode(self, text, max_len=20):
"""Transform text into a list of token indices with padding."""
words = re.findall(r'w+|[^sw]', text.lower())
indices = [self.word2idx.get(w, self.word2idx[""]) for w in words]
if len(indices) "]] * (max_len - len(indices))
else:
indices = indices[:max_len]
return indices
Dataset and Model Architecture: Transformer-Based Regression
We encapsulate our data into a PyTorch Dataset class that tokenizes each sentence and returns tensors suitable for batching. Our Regression Language Model consists of token embeddings combined with positional embeddings, which are passed through a multi-layer Transformer encoder. The encoder outputs are mean-pooled over non-padded tokens, then fed into a small feedforward network that predicts a single continuous value. This design enables the model to capture numerical semantics from language and map them to real-valued targets.
class RLMDataset(Dataset):
def __init__(self, data, tokenizer, max_len=20):
self.data = data
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
text, target = self.data[idx]
tokens = self.tokenizer.encode(text, self.max_len)
return torch.tensor(tokens), torch.tensor([target], dtype=torch.float32)
class RegressionLanguageModel(nn.Module):
def __init__(self, vocab_size, embed_dim=128, num_heads=4, num_layers=2,
dropout=0.1, max_len=20):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.position_embedding = nn.Embedding(max_len, embed_dim)
encoder_layer = nn.TransformerEncoderLayer(
d_model=embed_dim,
nhead=num_heads,
dim_feedforward=embed_dim * 4,
dropout=dropout,
batch_first=True
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
self.fc1 = nn.Linear(embed_dim, 64)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(dropout)
self.fc2 = nn.Linear(64, 1)
self.max_len = max_len
def forward(self, x):
batch_size, seq_len = x.shape
positions = torch.arange(seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1)
token_embeds = self.token_embedding(x)
pos_embeds = self.position_embedding(positions)
embeddings = token_embeds + pos_embeds
padding_mask = (x == 0)
encoded = self.transformer(embeddings, src_key_padding_mask=padding_mask)
mask_expanded = (~padding_mask).unsqueeze(-1).float()
summed = (encoded * mask_expanded).sum(dim=1)
pooled = summed / mask_expanded.sum(dim=1)
x = self.fc1(pooled)
x = self.relu(x)
x = self.dropout(x)
output = self.fc2(x)
return output
Training the Regression Language Model
We train the model using the Adam optimizer and mean squared error (MSE) loss function. The training loop iterates over mini-batches, performing backpropagation to update model weights. After each epoch, we evaluate the model on a validation set to monitor generalization. Training and validation losses are recorded and displayed to track progress and detect potential overfitting.
def train_rlm(model, train_loader, val_loader, epochs=15, lr=0.001, device='cpu'):
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
train_losses, val_losses = [], []
print(f"n📊 Training on {device}")
print("-" * 60)
model.to(device)
for epoch in range(epochs):
model.train()
total_train_loss = 0
for tokens, targets in train_loader:
tokens, targets = tokens.to(device), targets.to(device)
optimizer.zero_grad()
outputs = model(tokens)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
total_train_loss += loss.item()
avg_train_loss = total_train_loss / len(train_loader)
train_losses.append(avg_train_loss)
model.eval()
total_val_loss = 0
with torch.no_grad():
for tokens, targets in val_loader:
tokens, targets = tokens.to(device), targets.to(device)
outputs = model(tokens)
loss = criterion(outputs, targets)
total_val_loss += loss.item()
avg_val_loss = total_val_loss / len(val_loader)
val_losses.append(avg_val_loss)
print(f"Epoch {epoch+1:2d}/{epochs} | Train Loss: {avg_train_loss:.4f} | Val Loss: {avg_val_loss:.4f}")
return train_losses, val_losses
End-to-End Pipeline: Data Preparation, Training, and Evaluation
We generate 2,000 synthetic samples and split them into training (80%) and validation (20%) sets. The tokenizer is fitted on the training texts to build the vocabulary. We then create PyTorch datasets and data loaders for efficient batching. The Regression Language Model is instantiated and trained on the prepared data. After training, we plot the loss curves to visualize learning progress. Finally, we test the model on several example sentences to observe its numerical predictions.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("n📝 Generating synthetic dataset...")
data = generate_synthetic_data(2000)
split_index = int(0.8 * len(data))
train_data, val_data = data[:split_index], data[split_index:]
print(f"Training samples: {len(train_data)}, Validation samples: {len(val_data)}")
print("n🔧 Initializing tokenizer...")
tokenizer = SimpleTokenizer()
tokenizer.fit([text for text, _ in train_data])
print(f"Vocabulary size: {tokenizer.vocab_size}")
train_dataset = RLMDataset(train_data, tokenizer)
val_dataset = RLMDataset(val_data, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)
print("n🏗️ Constructing Regression Language Model...")
model = RegressionLanguageModel(vocab_size=tokenizer.vocab_size)
print(f"Total model parameters: {sum(p.numel() for p in model.parameters()):,}")
train_losses, val_losses = train_rlm(model, train_loader, val_loader, device=device)
plt.figure(figsize=(10, 5))
plt.plot(train_losses, label='Training Loss', linewidth=2)
plt.plot(val_losses, label='Validation Loss', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Mean Squared Error')
plt.title('Training and Validation Loss Over Epochs')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
print("n🎯 Testing model predictions on new inputs:")
print("-" * 60)
test_sentences = [
"The temperature is 22.3 degrees",
"I give this a rating of 7.5 out of ten",
"The cost amounts to 55.0 dollars",
"85.0 percent completed"
]
model.eval()
with torch.no_grad():
for sentence in test_sentences:
tokens = torch.tensor([tokenizer.encode(sentence)]).to(device)
prediction = model(tokens).item()
print(f"Input: {sentence}")
print(f"Predicted value: {prediction:.4f}n")
print("✅ Regression Language Model pipeline complete!")
Summary: Bridging Language and Numerical Reasoning with Transformers
In summary, we have successfully crafted a Regression Language Model that predicts continuous numerical values from textual descriptions. By integrating token and positional embeddings with a Transformer encoder and a regression head, the model learns to interpret numerical semantics embedded in language. The synthetic dataset approach allows controlled experimentation, while training visualization and testing on new examples demonstrate the model’s ability to generalize. This framework lays the foundation for advanced applications where understanding and quantifying information from text is essential.
