NLP · DistilBERT · 2024.IV

On polarity, distillation,
and the cost of compute.

Abstract: Binary sentiment classification over a sampled Amazon-review corpus. TF-IDF logistic regression is pitted against DistilBERT fine-tuned on synonym- and LLM-augmented data, then further distilled into a smaller student. Performance approaches the teacher's at roughly one-sixth the compute.
Method: preprocess → tokenise → TF-IDF baseline → fine-tune DistilBERT → distil-of-distil.
Data: 3.6 M reviews, balanced positive / negative labels.

Preprocessing

We normalised casing, tokenised with nltk, and stripped English stopwords before vectorising for the TF-IDF baseline. Reviews longer than 512 tokens were truncated for transformer input.

import nltk
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
    return " ".join(tokens)

Architecture

The transformer backbone is DistilBERT — a knowledge-distilled BERT that retains ~97% of the teacher's GLUE score at ~40% the parameter count. We fine-tune it for binary classification, then train a smaller student on its soft outputs.

DistilBERT student-teacher distillation diagram — Fig. 1 — Student-teacher distillation. The student mimics the teacher's soft output distribution.

Results

Three views of the same experiment. The baseline closes most of the gap to the transformer; augmentation closes the rest.

Accuracy / precision / recall / F1 across baselines — Fig. 2 — Random, TF-IDF LogReg, and a compute-limited DistilBERT. The baseline is stronger than you'd guess.

Performance across training-set fractions — Fig. 3 — DistilBERT trained on 1% → 100% of the data. Returns flatten fast; 25% already clears 93%.

Effect of LLM-based augmentation — Fig. 4 — With LLM augmentation on the limited set, we recover ~94% F1 — within a point of full-data training.

Lessons

Fine-tuning a distilled model on synthetic-augmented data is a surprisingly strong recipe when compute is constrained. A nested distillation (student-of-student) recovered most of the teacher's accuracy while cutting inference cost by roughly six-fold — useful when the target is a laptop or an edge device, not a datacentre.