Understanding the Limitations of Binary Cross-Entropy and the Advantages of Focal Loss in Imbalanced Classification
Binary cross-entropy (BCE) is widely used as the standard loss function for binary classification tasks. However, its effectiveness diminishes significantly when applied to datasets with severe class imbalance. The core issue lies in the fact that BCE treats errors from both classes with equal importance, regardless of how infrequent one class might be.
Why Binary Cross-Entropy Struggles with Imbalanced Data
Consider two prediction scenarios: one where a rare positive instance (minority class) with a true label of 1 is predicted with a probability of 0.3, and another where a common negative instance (majority class) with a true label of 0 is predicted at 0.7. Both cases yield the same BCE loss value of -log(0.3). But should these errors be penalized equally? In datasets where one class dominates, misclassifying the minority class is far more detrimental, yet BCE does not differentiate between these mistakes.
Introducing Focal Loss: A Solution for Imbalanced Classification
Focal Loss addresses this imbalance by diminishing the influence of well-classified, easy examples and emphasizing the harder, often minority-class samples. This mechanism enables the model to concentrate on learning the subtle patterns of the underrepresented class rather than being overwhelmed by the majority class. This approach has gained traction in fields like medical imaging and fraud detection, where minority classes are critical yet scarce.
Setting Up the Experiment: Generating an Imbalanced Dataset
To illustrate the difference between BCE and Focal Loss, we generate a synthetic binary classification dataset with a pronounced 99:1 class imbalance using 6,000 samples. This setup mimics real-world scenarios such as rare disease detection, where the positive cases are extremely limited compared to negatives.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import makeclassification
from sklearn.modelselection import traintestsplit
import torch
import torch.nn as nn
import torch.optim as optim
Create a highly imbalanced dataset
X, y = makeclassification(
nsamples=6000,
nfeatures=2,
nredundant=0,
nclustersperclass=1,
weights=[0.99, 0.01],
classsep=1.5,
randomstate=42
)
Xtrain, Xtest, ytrain, ytest = traintestsplit(
X, y, testsize=0.3, randomstate=42
)
Xtrain = torch.tensor(Xtrain, dtype=torch.float32)
ytrain = torch.tensor(ytrain, dtype=torch.float32).unsqueeze(1)
Xtest = torch.tensor(Xtest, dtype=torch.float32)
ytest = torch.tensor(ytest, dtype=torch.float32).unsqueeze(1)
Designing a Simple Neural Network Architecture
We implement a straightforward neural network with two hidden layers to maintain focus on the impact of the loss functions rather than model complexity. This architecture is sufficient to capture the decision boundary in our two-dimensional feature space and clearly demonstrate the contrasting behaviors of BCE and Focal Loss.
class SimpleNN(nn.Module):
def init(self):
super().init()
self.network = nn.Sequential(
nn.Linear(2, 16),
nn.ReLU(),
nn.Linear(16, 8),
nn.ReLU(),
nn.Linear(8, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.network(x)
Implementing Focal Loss for Enhanced Minority Class Learning
The Focal Loss function modifies the traditional BCE by applying a modulating factor that down-weights easy examples and focuses training on difficult, misclassified samples. The parameter gamma controls the rate at which easy examples are suppressed, while alpha balances the importance of the minority class. This tailored loss function is particularly effective in domains like anomaly detection, where rare events must be identified accurately.
class FocalLoss(nn.Module):
def init(self, alpha=0.25, gamma=2):
super().init()
self.alpha = alpha
self.gamma = gamma
def forward(self, preds, targets):
epsilon = 1e-7
preds = torch.clamp(preds, epsilon, 1 - epsilon)
pt = torch.where(targets == 1, preds, 1 - preds)
loss = -self.alpha (1 - pt) self.gamma torch.log(pt)
return loss.mean()
Training and Evaluating Models with BCE and Focal Loss
We train two identical neural networks: one optimized with standard BCE loss and the other with Focal Loss. Both models are trained for 30 epochs using the Adam optimizer. Despite BCE achieving a high overall accuracy (~98%), this metric is misleading due to the overwhelming majority class. Focal Loss, however, improves detection of the minority class, resulting in a more meaningful accuracy (~99%) that reflects better performance on rare samples.
def trainmodel(model, lossfunction, learningrate=0.01, epochs=30):
optimizer = optim.Adam(model.parameters(), lr=learningrate)
for in range(epochs):
predictions = model(Xtrain)
loss = lossfunction(predictions, ytrain)
optimizer.zerograd()
loss.backward()
optimizer.step()
with torch.nograd():
testpredictions = model(Xtest)
accuracy = ((testpredictions > 0.5).float() == ytest).float().mean().item()
return accuracy, testpredictions.squeeze().detach().numpy()
Initialize models
modelbce = SimpleNN()
modelfocal = SimpleNN()
Train models
accuracybce, predsbce = trainmodel(modelbce, nn.BCELoss())
accuracyfocal, predsfocal = trainmodel(modelfocal, FocalLoss(alpha=0.25, gamma=2))
print(f"Test Accuracy with BCE: {accuracybce:.4f}")
print(f"Test Accuracy with Focal Loss: {accuracyfocal:.4f}")
Visualizing Decision Boundaries: BCE vs. Focal Loss
The decision boundary learned by the BCE model tends to be nearly flat, predominantly predicting the majority class and neglecting minority instances. This occurs because BCE is heavily influenced by the abundant majority samples. Conversely, the Focal Loss model delineates a more nuanced boundary, effectively capturing minority class regions and demonstrating its superior ability to learn from imbalanced data.
def plotdecisionboundary(model, title):
xmin, xmax = X[:, 0].min() - 1, X[:, 0].max() + 1
ymin, ymax = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(
np.linspace(xmin, xmax, 300),
np.linspace(ymin, ymax, 300)
)
gridpoints = torch.tensor(np.c[xx.ravel(), yy.ravel()], dtype=torch.float32)
with torch.nograd():
Z = model(gridpoints).reshape(xx.shape)
plt.contourf(xx, yy, Z, levels=[0, 0.5, 1], alpha=0.4, cmap='coolwarm')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', s=10, edgecolors='k')
plt.title(title)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
plotdecisionboundary(modelbce, "Decision Boundary with Binary Cross-Entropy")
plotdecisionboundary(modelfocal, "Decision Boundary with Focal Loss")
Confusion Matrix Analysis: Highlighting Minority Class Recognition
Examining the confusion matrices reveals stark differences: the BCE-trained model correctly identifies only a single minority-class instance while misclassifying 27. This reflects its bias toward the majority class. In contrast, the Focal Loss model improves minority class recognition by correctly classifying 14 instances and reducing misclassifications to 14, showcasing its effectiveness in emphasizing challenging samples.
from sklearn.metrics import confusionmatrix, ConfusionMatrixDisplay
def displayconfusionmatrix(truelabels, predictedlabels, title):
cm = confusionmatrix(truelabels, predictedlabels)
disp = ConfusionMatrixDisplay(confusionmatrix=cm)
disp.plot(cmap='Blues', valuesformat='d')
plt.title(title)
plt.show()
ytestnp = ytest.numpy().astype(int)
predsbcelabels = (predsbce > 0.5).astype(int)
predsfocallabels = (predsfocal > 0.5).astype(int)
displayconfusionmatrix(ytestnp, predsbcelabels, "Confusion Matrix - BCE Loss")
displayconfusionmatrix(ytestnp, predsfocal_labels, "Confusion Matrix - Focal Loss")
By focusing training on difficult, minority-class examples, Focal Loss offers a robust alternative to binary cross-entropy for imbalanced classification problems. This approach is increasingly vital in applications such as fraud detection, rare event prediction, and medical diagnostics, where identifying the minority class accurately is crucial.
