WSDM Chatbot Arena – Fine‑Tuning Gemma‑2B with QLoRA & PEFT

PythonPyTorchHuggingfaceTransformersGemma‑2Bbitsandbytes (4‑bit quantization)PEFT / LoRA adaptersQLoRAGoogle Colab (A100 GPU)
Generated Image September 07, 2025 - 9 19PM

Problem

The WSDM Chatbot Arena competition asks participants to predict which of two chatbot responses a human would prefer when given a user prompt. Each training sample comprises a user query, two responses from different large language models, and a human‑annotated label indicating the preferred response. The task is cast as a binary sequence‑classification problem: given the prompt and responses concatenated together, the model must decide whether response A or B is preferred

Model

A Gemma‑2 B model is used as the base. The model architecture is Gemma2ForSequenceClassification from the Hugging Face Transformers library

  • To fit within resource constraints (single A100 GPU), the model is quantized to 4‑bit weights via the bitsandbytes library and then equipped with LoRA adapters (QLoRA/PEFT). The first six transformer layers are frozen to preserve the base model’s semantics, and only the upper layers (with LoRA adapters) are trained.

  • The input is truncated using a head‑and‑tail strategy: the tokenized prompt and responses are shortened by retaining the first and last segments to preserve important context.

Fine‑Tuning Strategies

Quantization & PEFT – The model uses 4‑bit quantization and parameter‑efficient fine‑tuning with LoRA adapters GitHub

Layer Freezing The lower six layers of the transformer are frozen while LoRA adapters are applied to higher layers

Truncation – Long conversations are handled by keeping the first and last parts of the prompt–response sequence (head+tail), capping inputs at ~1 900 tokens

Data Handling – The multilingual dataset is split 70 %/10 %/20 % for training/validation/test and tokenized using Gemma’s tokenizer

Early Stopping & Dynamic Padding – Training uses early stopping (patience = 3) and dynamic padding via DataCollatorWithPadding

Hyperparameters

The notebook defines a train_model function with hyperparameters (learning rate = 5 e‑5, batch size = 1, epochs = 3, gradient accumulation steps = 4, etc.)

HyperparameterValueNotes
Learning rate2 × 10⁻⁵The learning rate controls how much the model’s weights are updated at each step. A small value like 2e‑5 is common in PEFT/LoRA fine‑tuning because the base model already contains useful knowledge
Batch size1 per device (train & eval)A batch size of 1 keeps memory usage low for a large 2‑billion‑parameter modelg
Number of epochs1Single epoch fine‑tuning
Gradient accumulation steps8Accumulating gradients over 8 mini‑batches before performing an optimizer step effectively simulates a batch size of 8, which reduces the variance of gradient estimates and leads to more stable updates
Weight decay0.01This L2 regularization term penalizes large weights, discouraging overly complex solutions and improving generalization to unseen data
Warm‑up ratio0.1Starting with a small learning rate and linearly increasing it over the first 10 % of training helps stabilize early training when weights are still adjusting to the new task
SchedulerCosine decayAfter the warmup phase, a cosine schedule gradually reduces the learning rate, allowing larger updates early on and finer adjustments as training progresses
OptimizerAdamW (fused)AdamW combines adaptive learning rates with weight decay, offering fast convergence and robust performance for transformer models.

Collectively, these hyperparameters are chosen to balance efficiency and effectiveness: they enable fine‑tuning a large quantized model on modest hardware while maintaining training stability and minimizing overfitting