DistilBERT
GitHub Repository

Introduction to the Research 🎯

This project investigates whether machine learning techniques can effectively classify political speeches in Spain according to party affiliation, and how such classifications can offer insights into ideological and rhetorical differences between parties. The work is situated at the intersection of Natural Language Processing (NLP) and political science, using speeches from Spain’s national parliament between 2015 and 2023. The ultimate goal is to understand whether the semantics of political discourse align with party lines—and if so, to what extent.

Context and Background 🏛️📚

Since the financial crisis of 2008, the Spanish political system has evolved from a stable two-party system into a complex and fragmented landscape. Alongside the traditional Spanish Socialist Workers’ Party (PSOE) and the Popular Party (PP), new entrants like Podemos, Ciudadanos, and Vox have shifted the ideological balance. This shift provides fertile ground to investigate how different parties express themselves in parliament and whether machine learning models can reliably distinguish between them based solely on language use.

Data Overview and Preprocessing 🔍🗂️

The dataset used originates from the ParlaMint project, which aggregates and annotates parliamentary debate transcripts. The raw data—composed of multiple .tsv and .txt files—required consolidation into a usable DataFrame. After filtering to retain only relevant speeches (e.g., by “diputado/a”), and removing non-informative interventions, the data was processed into four binary classification subsets:

  • PSOE vs. PP
  • Podemos vs. Vox
  • ERC vs. JxCat
  • Bildu vs. PNV

To ensure fairness and relevance, we implemented a speaker-level split for training and testing sets. Tokenization, lemmatization, and the removal of common names and party references were executed to avoid model bias. Additionally, speeches were segmented into chunks of approximately 100–300 tokens to standardize length and control for verbosity.

# Example: Token slicing for standardization
from nltk.tokenize import word_tokenize

def segment_speech(text, min_len=100, max_len=300):
    tokens = word_tokenize(text)
    segments = [" ".join(tokens[i:i+max_len]) for i in range(0, len(tokens), max_len)]
    return [seg for seg in segments if len(word_tokenize(seg)) >= min_len]

Modelling Approach 🧠📈

Two classifiers were employed: Random Forest and XGBoost. Both were trained on the PSOE–PP dataset and then evaluated on held-out speakers from the same dataset, as well as on the other political party pairs. This transfer learning setup tested the models’ generalization capabilities across ideological and regional divides.

Model predictions were framed probabilistically: a score close to 0 suggested similarity to PSOE-style rhetoric, while scores near 1 indicated alignment with PP rhetoric. The use of both models provided robustness, with Random Forest offering centered predictions and XGBoost delivering more distributed and confident outputs.

Performance and Findings 📊✅

The best model, XGBoost, achieved out-of-sample AUC scores around 0.9 for PSOE–PP classification, indicating strong discriminatory power even when the speakers in the test set had not been seen during training. These results suggest that party-affiliated language is sufficiently distinct for ML models to exploit.

# XGBoost Training Snippet
from xgboost import XGBClassifier

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)
y_pred_proba = xgb.predict_proba(X_test)[:, 1]

The results were visualized using prediction distribution plots that mimicked a political spectrum. For example, PSOE speeches clustered around 0, while PP speeches gathered near 1. These visualizations reinforced the model’s semantic sensitivity to ideological nuances.

PSOE-PP

Feature Importances and Interpretation 🧩📝

The top influential n-grams for classification included terms like derecha, ciudadanía, crear empleo, recorte, and independentista. Many of these terms relate to socio-economic policy or political identity. Based on the polarity of these terms in known contexts, it is inferred that left-leaning parties tend to use progressive and collectivist terminology, while right-leaning parties focus on order, national unity, and criticism of the opposing ideology.

Term 1ValueTerm 2ValueTerm 3Value
derecha0.038142izquierda0.004678término0.002546
ciudadanía0.024661asunto0.004543comentar0.002514
transformación0.011111elemento0.004511proyecto presupuesto0.002503
socio0.01076420160.004418millón español0.002464
crear empleo0.008942recortar0.004065convalidación real0.002435
cuestión0.008694ultraderecha0.004033mundo0.002373
evidentemente0.008617extraordinario0.003891español española0.002241
diputado bien0.007627digno0.003856pagar0.002240
transición0.007456destruir0.003501español tener0.002175
respecto0.007144indicar0.003105ciudadano ciudadana0.002162
creación empleo0.006475colectivo0.003102historia0.002135
pandemia0.006063inversión0.003018periodo0.002118
empleo0.005920ciudadana0.002989formula0.002117
recorte0.005070amnistía fiscal0.002922constitución0.002065
mayoría absoluto0.00487320150.002893tejido productivo0.002044
independentista0.004827propaganda0.002889trabajador0.002038
necesidad0.004817crecer0.002835

This insight was further illustrated by speech excerpts. For instance, a Vox speech on national identity was classified similarly to PP, while a feminist Podemos speech aligned with PSOE. These examples highlight the classifier’s ability to map ideological content to party identity, even beyond the training set.

Transfer Learning on Other Party Pairs 🔄🇪🇸

Generalization to Podemos–Vox was particularly successful, yielding an AUC of 0.86. This confirms a semantic continuum within ideological blocks. However, performance dropped for ERC–JxCat and Bildu–PNV, likely due to region-specific discourse not present in PSOE–PP training data. This finding underlines the challenge of modeling regional political identity using national-level rhetorical training.

Podemos-Vox:

Podemos-Vox

ERC–JxCat:

ERC-JxCat

Key Insights and Challenges ✨🧠

  • Semantic Consistency: Ideologically similar parties share rhetorical patterns, allowing cross-party generalization.
  • Model Robustness: Removing lexical “giveaways” like names and party references improved model generality.
  • Domain-Specificity: Regional party rhetoric requires region-specific training data or domain adaptation strategies.

These insights stress the importance of thoughtful data curation and model validation when applying ML in political analysis.

Final Thoughts and Future Directions 🧭📚

This project demonstrates that political ideology is not only present but quantifiable in parliamentary speech. With adequate preprocessing and model selection, it is possible to classify party affiliation with considerable accuracy. Future work could include:

  • Incorporating temporal dynamics to analyze discourse evolution.
  • Fine-tuning transformer-based models (e.g., BERT) on political corpora.
  • Expanding to multilingual or cross-country parliamentary data for broader applicability.

In sum, the project shows that machine learning and NLP are not just capable of detecting ideological lines—they can also provide meaningful insights into how these lines are drawn through language. 🧾📢📐