Reproducing Guanaco. Does evaluation loss matter when… | by Geronimo

I used the almost original finetune_guanaco_7b.sh from the QLoRA repository. The parameters I modified were max_memory_MB (trained on 2x RTX4090 with 24GB VRAM on runpod), increased per_device_eval_batch_size (speed), removed do_mmlu_eval (speed), and custom model paths. I intentionally kept per_device_train_batch_size the same since this might change the models quality, even though it was painfully slow.

python qlora.py 
--model_name_or_path "/workspace/models/llama2-7b-hf" 
--use_auth 
--output_dir /workspace/loras/llama2-guanaco-7b 
--logging_steps 10 
--save_strategy steps 
--data_seed 42 
--save_steps 500 
--save_total_limit 100 
--evaluation_strategy steps 
--eval_dataset_size 1024 
--max_eval_samples 1000 
--per_device_eval_batch_size 16 
--max_new_tokens 32 
--dataloader_num_workers 1 
--group_by_length 
--logging_strategy steps 
--remove_unused_columns False 
--do_train 
--do_eval 
--lora_r 64 
--lora_alpha 16 
--lora_modules all 
--double_quant 
--quant_type nf4 
--bf16 
--bits 4 
--warmup_ratio 0.03 
--lr_scheduler_type constant 
--gradient_checkpointing 
--dataset oasst1 
--source_max_len 16 
--target_max_len 512 
--per_device_train_batch_size 1 
--gradient_accumulation_steps 16 
--max_steps 6000 
--eval_steps 100 
--learning_rate 0.0002 
--adam_beta2 0.999 
--max_grad_norm 0.3 
--lora_dropout 0.1 
--weight_decay 0.0 
--seed 0 
--report_to wandb 
--max_memory_MB 23000

Weird oscillations caused by **— group_by_length**, see this.

For the nerds: beautiful runpod cooling. That’s what my GPUs at home look like doing nothing.

Chatbot evaluation is hard. Was hard — a lot has happened since I wrote this post on automated chatbot evaluation three months ago.

Human evaluation

The gold standard of chatbot performance is how well a human likes talking to it (aka the downstream task). LMSYS came up with the chatbot arena, a ChatGPT-like website that shows the answers of two models side-by-side, anonymised, the identity of the models is not given. The user selects which answers he prefers. Many models in the arena and thousands of pairwise comparisons led to a single human evaluation score for each model. See this for details.

GPT-4 based evaluation: MT-Bench

The LMSYS researchers in Berkeley and others have adapted their original Vicuna benchmark where each model was asked 80 questions and the answers were passed on to GPT-4 who scored a 1 to 10 rating for each answer.

The successor of the Vicuna benchmark is MT-Bench (code, paper) and evaluates multi-turn conversations.

MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants. To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models’ responses.

Having both human evaluation data and a new automated benchmark they found that MT-bench performs very well, apparently “achieving over 80% agreement, the same level of agreement between humans”.

It was a pleasant surprise to see how easy their code was to use.

Generating model answers

python3 gen_model_answer.py --model-path models/llama-2-7b-guanaco-checkpoint-500 --model-id llama-2-7b-guanaco-checkpoint-500

Evaluation by Judge GPT4

export OPENAI_API_KEY=not_a_real_OpenAI_kay# please ignore the weird model paths
models="alpaca-13b 
gpt-4 
llama-2-7b-guanaco-checkpoint-500 
llama-2-7b-guanaco-checkpoint-1000 
llama-2-7b-guanaco-checkpoint-1500 
llama-2-7b-guanaco-checkpoint-2000 
llama-2-7b-guanaco-checkpoint-2500 
llama-2-7b-guanaco-checkpoint-3000 
llama-2-7b-guanaco-checkpoint-3500"
python3 gen_judgment.py --mode pairwise-baseline --model-list ${models} --parallel 4

The mode pairwise-baseline compares each models answer to a baseline model, the default is GPT-3.5. As controls I added answers by alpaca-13b (negative) and GPT-4 (positive).

Mode: pairwise-baseline
Input file: data/mt_bench/model_judgment/gpt-4_pair.jsonl
win  loss  tie  win_rate  loss_rate  win_rate_adjusted
model                                                                                                                   
gpt-4                               102    17   41   0.63750    0.10625           0.765625
llama-2-7b-guanaco-checkpoint-2500   24    98   38   0.15000    0.61250           0.268750
llama-2-7b-guanaco-checkpoint-3500   22    99   39   0.13750    0.61875           0.259375
llama-2-7b-guanaco-checkpoint-3000   21   100   39   0.13125    0.62500           0.253125
llama-2-7b-guanaco-checkpoint-2000   20   107   33   0.12500    0.66875           0.228125
llama-2-7b-guanaco-checkpoint-1500   12   109   39   0.07500    0.68125           0.196875
llama-2-7b-guanaco-checkpoint-1000   10   119   31   0.06250    0.74375           0.159375
llama-2-7b-guanaco-checkpoint-500     6   129   25   0.03750    0.80625           0.115625
alpaca-13b                            6   134   20   0.03750    0.83750           0.100000                                                        6   134   20   0.03750    0.83750           0.100000