The WSDM Chatbot Arena competition asks participants to predict which of two chatbot responses a human would prefer when given a user prompt. Each training sample comprises a user query, two responses from different large language models, and a human‑annotated label indicating the preferred response. The task is cast as a binary sequence‑classification problem: given the prompt and responses concatenated together, the model must decide whether response A or B is preferred
A Gemma‑2 B model is used as the base. The model architecture is Gemma2ForSequenceClassification from the Hugging Face Transformers library
To fit within resource constraints (single A100 GPU), the model is quantized to 4‑bit weights via the bitsandbytes library and then equipped with LoRA adapters (QLoRA/PEFT). The first six transformer layers are frozen to preserve the base model’s semantics, and only the upper layers (with LoRA adapters) are trained.
The input is truncated using a head‑and‑tail strategy: the tokenized prompt and responses are shortened by retaining the first and last segments to preserve important context.
Quantization & PEFT – The model uses 4‑bit quantization and parameter‑efficient fine‑tuning with LoRA adapters GitHub
Layer Freezing The lower six layers of the transformer are frozen while LoRA adapters are applied to higher layers
Truncation – Long conversations are handled by keeping the first and last parts of the prompt–response sequence (head+tail), capping inputs at ~1 900 tokens
Data Handling – The multilingual dataset is split 70 %/10 %/20 % for training/validation/test and tokenized using Gemma’s tokenizer
Early Stopping & Dynamic Padding – Training uses early stopping (patience = 3) and dynamic padding via DataCollatorWithPadding
The notebook defines a train_model function with hyperparameters (learning rate = 5 e‑5, batch size = 1, epochs = 3, gradient accumulation steps = 4, etc.)
Hyperparameter | Value | Notes |
---|---|---|
Learning rate | 2 × 10⁻⁵ | The learning rate controls how much the model’s weights are updated at each step. A small value like 2e‑5 is common in PEFT/LoRA fine‑tuning because the base model already contains useful knowledge |
Batch size | 1 per device (train & eval) | A batch size of 1 keeps memory usage low for a large 2‑billion‑parameter modelg |
Number of epochs | 1 | Single epoch fine‑tuning |
Gradient accumulation steps | 8 | Accumulating gradients over 8 mini‑batches before performing an optimizer step effectively simulates a batch size of 8, which reduces the variance of gradient estimates and leads to more stable updates |
Weight decay | 0.01 | This L2 regularization term penalizes large weights, discouraging overly complex solutions and improving generalization to unseen data |
Warm‑up ratio | 0.1 | Starting with a small learning rate and linearly increasing it over the first 10 % of training helps stabilize early training when weights are still adjusting to the new task |
Scheduler | Cosine decay | After the warmup phase, a cosine schedule gradually reduces the learning rate, allowing larger updates early on and finer adjustments as training progresses |
Optimizer | AdamW (fused) | AdamW combines adaptive learning rates with weight decay, offering fast convergence and robust performance for transformer models. |
Collectively, these hyperparameters are chosen to balance efficiency and effectiveness: they enable fine‑tuning a large quantized model on modest hardware while maintaining training stability and minimizing overfitting