Home Industries Education How to Implement Functional Components of Transformer and Mini-GPT Model from Scratch...

How to Implement Functional Components of Transformer and Mini-GPT Model from Scratch Using Tinygrad to Understand Deep Learning Internals

0

Hands-On Guide to Building Neural Networks with Tinygrad

This comprehensive tutorial walks you through constructing neural networks from the ground up using Tinygrad, a minimalist deep learning framework. We dive deep into tensor manipulations, automatic differentiation, attention mechanisms, and transformer architectures, progressively assembling each element-from fundamental tensor operations to multi-head attention modules, transformer blocks, and ultimately a compact GPT-style model. Throughout the process, Tinygrad’s straightforward design offers clear insights into the inner workings of model training, optimization, and kernel fusion for enhanced performance.

Step 1: Setting Up Tinygrad and Exploring Tensor Operations

We begin by installing Tinygrad in a Colab environment and immediately start experimenting with tensors and autograd. By constructing a simple computation graph involving matrix multiplications and element-wise operations, we observe how gradients propagate during backpropagation. This hands-on approach demystifies the automatic differentiation process, revealing how Tinygrad tracks operations to compute derivatives efficiently.

import subprocess, sys
print("Installing dependencies...")
subprocess.checkcall(["apt-get", "install", "-qq", "clang"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.checkcall([sys.executable, "-m", "pip", "install", "-q", "git+https://github.com/tinygrad/tinygrad.git"])

import numpy as np
from tinygrad import Tensor, Device

print(f"Using device: {Device.DEFAULT}")
print("="  60)

x = Tensor([[1.0, 2.0], [3.0, 4.0]], requiresgrad=True)
y = Tensor([[2.0, 0.0], [1.0, 2.0]], requiresgrad=True)

z = (x @ y).sum() + (x  2).mean()
z.backward()

print(f"x:n{x.numpy()}")
print(f"y:n{y.numpy()}")
print(f"z (scalar): {z.numpy()}")
print(f"∂z/∂x:n{x.grad.numpy()}")
print(f"∂z/∂y:n{y.grad.numpy()}")

Step 2: Crafting Custom Multi-Head Attention and Transformer Blocks

Next, we build a multi-head attention mechanism and transformer block from scratch. This involves manually implementing query, key, and value projections, scaled dot-product attention with softmax normalization, feedforward layers, and layer normalization. By running this code, we gain a granular understanding of how each component contributes to the transformer’s functionality, laying the foundation for more complex architectures.

class MultiHeadAttention:
    def init(self, dim, numheads):
        self.numheads = numheads
        self.dim = dim
        self.headdim = dim // numheads
        self.qkv = Tensor.glorotuniform(dim, 3  dim)
        self.out = Tensor.glorotuniform(dim, dim)

    def call(self, x):
        B, T, C = x.shape
        qkv = x.reshape(B  T, C).dot(self.qkv).reshape(B, T, 3, self.numheads, self.headdim)
        q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
        scale = self.headdim  -0.5
        attnscores = (q @ k.transpose(-2, -1))  scale
        attnprobs = attnscores.softmax(axis=-1)
        out = (attnprobs @ v).transpose(1, 2).reshape(B, T, C)
        return out.reshape(B  T, C).dot(self.out).reshape(B, T, C)

class TransformerBlock:
    def init(self, dim, numheads):
        self.attn = MultiHeadAttention(dim, numheads)
        self.ff1 = Tensor.glorotuniform(dim, 4  dim)
        self.ff2 = Tensor.glorotuniform(4  dim, dim)
        self.ln1w = Tensor.ones(dim)
        self.ln2w = Tensor.ones(dim)

    def call(self, x):
        x = x + self.attn(self.layernorm(x, self.ln1w))
        ff = x.reshape(-1, x.shape[-1])
        ff = ff.dot(self.ff1).gelu().dot(self.ff2)
        x = x + ff.reshape(x.shape)
        return self.layernorm(x, self.ln2w)

    def layernorm(self, x, w):
        mean = x.mean(axis=-1, keepdim=True)
        var = ((x - mean)  2).mean(axis=-1, keepdim=True)
        return w  (x - mean) / (var + 1e-5).sqrt()

Step 3: Constructing a Compact GPT Model

Building on the previous components, we assemble a mini-GPT model. This includes token embeddings, positional encodings, stacking multiple transformer blocks, and a final linear layer projecting to vocabulary logits. Despite its simplicity, this model encapsulates the core principles of transformer-based language models, demonstrating how a functional GPT can be implemented with minimal code.

class MiniGPT:
    def init(self, vocabsize=256, dim=128, numheads=4, numlayers=2, maxlen=32):
        self.vocabsize = vocabsize
        self.dim = dim
        self.tokemb = Tensor.glorotuniform(vocabsize, dim)
        self.posemb = Tensor.glorotuniform(maxlen, dim)
        self.blocks = [TransformerBlock(dim, numheads) for  in range(numlayers)]
        self.lnf = Tensor.ones(dim)
        self.head = Tensor.glorotuniform(dim, vocabsize)

    def call(self, idx):
        B, T = idx.shape
        tokemb = self.tokemb[idx.flatten()].reshape(B, T, self.dim)
        posemb = self.posemb[:T].reshape(1, T, self.dim)
        x = tokemb + posemb
        for block in self.blocks:
            x = block(x)
        mean = x.mean(axis=-1, keepdim=True)
        var = ((x - mean)  2).mean(axis=-1, keepdim=True)
        x = self.lnf  (x - mean) / (var + 1e-5).sqrt()
        return x.reshape(B  T, self.dim).dot(self.head).reshape(B, T, self.vocabsize)

    def getparams(self):
        params = [self.tokemb, self.posemb, self.lnf, self.head]
        for block in self.blocks:
            params.extend([block.attn.qkv, block.attn.out, block.ff1, block.ff2, block.ln1w, block.ln2w])
        return params

model = MiniGPT(vocabsize=256, dim=64, numheads=4, numlayers=2, maxlen=16)
params = model.getparams()
totalparams = sum(p.numel() for p in params)
print(f"Model initialized with {totalparams:,} parameters")

Step 4: Training the MiniGPT on Synthetic Data

We train the mini-GPT model using artificially generated sequences, where the task is to predict the preceding token in the sequence. Utilizing the Adam optimizer, we monitor the loss reduction over multiple iterations, confirming that the model learns meaningful representations even with simple data. This step highlights the training loop mechanics and optimization strategies in Tinygrad.

def generatedata(batchsize, seqlen):
    x = np.random.randint(0, 256, (batchsize, seqlen))
    y = np.roll(x, 1, axis=1)
    y[:, 0] = x[:, 0]
    return Tensor(x, dtype='int32'), Tensor(y, dtype='int32')

from tinygrad.nn import optim
import time

optimizer = optim.Adam(params, lr=0.001)
losses = []

print("Training model to predict previous tokens...")
with Tensor.train():
    for step in range(20):
        starttime = time.time()
        xbatch, ybatch = generatedata(batchsize=16, seqlen=16)
        logits = model(xbatch)
        B, T, V = logits.shape
        loss = logits.reshape(B  T, V).sparsecategoricalcrossentropy(ybatch.reshape(B  T))
        optimizer.zerograd()
        loss.backward()
        optimizer.step()
        losses.append(loss.numpy())
        elapsed = time.time() - starttime
        if step % 5 == 0:
            print(f"Step {step:3d} | Loss: {loss.numpy():.4f} | Time: {elapsed1000:.1f}ms")

Step 5: Leveraging Lazy Evaluation and Kernel Fusion for Efficiency

Tinygrad employs lazy evaluation, deferring computation until results are explicitly requested. This enables kernel fusion, where multiple operations combine into a single optimized kernel, significantly boosting performance. We illustrate this by creating a complex tensor expression and measuring execution time only when the computation is realized.

N = 512
a = Tensor.randn(N, N)
b = Tensor.randn(N, N)

print("Defining computation: (A @ B.T + A).sum()")
lazyresult = (a @ b.T + a).sum()
print("→ Computation deferred (lazy evaluation)")

print("Executing computation with .realize()...")
start = time.time()
result = lazyresult.realize()
elapsed = time.time() - start

print(f"✔ Computed in {elapsed1000:.2f}ms")
print(f"Result: {result.numpy():.4f}")
print("Note: Operations were fused into efficient kernels!")

Step 6: Creating and Testing Custom Activation Functions

To demonstrate Tinygrad’s flexibility, we implement a custom activation function combining element-wise multiplication and sigmoid. We verify that gradients correctly propagate through this function by performing backpropagation on a sample input, confirming the framework’s support for user-defined operations.

def customactivation(x):
    return x * x.sigmoid()

x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]], requiresgrad=True)
y = customactivation(x)
loss = y.sum()
loss.backward()

print(f"Input:    {x.numpy()}")
print(f"Custom Activation Output: {y.numpy()}")
print(f"Gradient: {x.grad.numpy()}")

Summary of Key Learnings

  • Fundamentals of tensor operations and automatic differentiation
  • Designing custom neural network layers including attention and transformer blocks
  • Building a compact GPT-style language model from scratch
  • Implementing a training loop with the Adam optimizer on synthetic data
  • Understanding lazy evaluation and kernel fusion for computational efficiency
  • Creating and validating custom activation functions

By following this tutorial, you gain a transparent view into the mechanics of neural networks beyond high-level abstractions. Tinygrad’s minimalistic yet powerful design empowers you to experiment with every detail-from tensor math to model architecture and optimization. This foundation prepares you for advanced explorations, such as integrating real-world datasets, extending model capabilities, or optimizing performance further.


Explore more tutorials and resources to deepen your understanding of deep learning internals and Tinygrad’s capabilities. Stay connected with the community for updates, discussions, and collaborative projects.

Exit mobile version