![](https://crypto4nerd.com/wp-content/uploads/2023/08/12BU-b4WEH4eicnh0Hj82VQ-1024x399.png)
I used the almost original finetune_guanaco_7b.sh from the QLoRA repository. The parameters I modified were max_memory_MB (trained on 2x RTX4090 with 24GB VRAM on runpod), increased per_device_eval_batch_size (speed), removed do_mmlu_eval (speed), and custom model paths. I intentionally kept per_device_train_batch_size the same since this might change the models quality, even though it was painfully slow.
python qlora.py
--model_name_or_path "/workspace/models/llama2-7b-hf"
--use_auth
--output_dir /workspace/loras/llama2-guanaco-7b
--logging_steps 10
--save_strategy steps
--data_seed 42
--save_steps 500
--save_total_limit 100
--evaluation_strategy steps
--eval_dataset_size 1024
--max_eval_samples 1000
--per_device_eval_batch_size 16
--max_new_tokens 32
--dataloader_num_workers 1
--group_by_length
--logging_strategy steps
--remove_unused_columns False
--do_train
--do_eval
--lora_r 64
--lora_alpha 16
--lora_modules all
--double_quant
--quant_type nf4
--bf16
--bits 4
--warmup_ratio 0.03
--lr_scheduler_type constant
--gradient_checkpointing
--dataset oasst1
--source_max_len 16
--target_max_len 512
--per_device_train_batch_size 1
--gradient_accumulation_steps 16
--max_steps 6000
--eval_steps 100
--learning_rate 0.0002
--adam_beta2 0.999
--max_grad_norm 0.3
--lora_dropout 0.1
--weight_decay 0.0
--seed 0
--report_to wandb
--max_memory_MB 23000
Chatbot evaluation is hard. Was hard — a lot has happened since I wrote this post on automated chatbot evaluation three months ago.
Human evaluation
The gold standard of chatbot performance is how well a human likes talking to it (aka the downstream task). LMSYS came up with the chatbot arena, a ChatGPT-like website that shows the answers of two models side-by-side, anonymised, the identity of the models is not given. The user selects which answers he prefers. Many models in the arena and thousands of pairwise comparisons led to a single human evaluation score for each model. See this for details.
GPT-4 based evaluation: MT-Bench
The LMSYS researchers in Berkeley and others have adapted their original Vicuna benchmark where each model was asked 80 questions and the answers were passed on to GPT-4 who scored a 1 to 10 rating for each answer.
The successor of the Vicuna benchmark is MT-Bench (code, paper) and evaluates multi-turn conversations.
MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants. To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models’ responses.
Having both human evaluation data and a new automated benchmark they found that MT-bench performs very well, apparently “achieving over 80% agreement, the same level of agreement between humans”.
It was a pleasant surprise to see how easy their code was to use.
Generating model answers
python3 gen_model_answer.py --model-path models/llama-2-7b-guanaco-checkpoint-500 --model-id llama-2-7b-guanaco-checkpoint-500
Evaluation by Judge GPT4
export OPENAI_API_KEY=not_a_real_OpenAI_kay# please ignore the weird model paths
models="alpaca-13b
gpt-4
llama-2-7b-guanaco-checkpoint-500
llama-2-7b-guanaco-checkpoint-1000
llama-2-7b-guanaco-checkpoint-1500
llama-2-7b-guanaco-checkpoint-2000
llama-2-7b-guanaco-checkpoint-2500
llama-2-7b-guanaco-checkpoint-3000
llama-2-7b-guanaco-checkpoint-3500"
python3 gen_judgment.py --mode pairwise-baseline --model-list ${models} --parallel 4
The mode pairwise-baseline compares each models answer to a baseline model, the default is GPT-3.5. As controls I added answers by alpaca-13b (negative) and GPT-4 (positive).
Mode: pairwise-baseline
Input file: data/mt_bench/model_judgment/gpt-4_pair.jsonl
win loss tie win_rate loss_rate win_rate_adjusted
model
gpt-4 102 17 41 0.63750 0.10625 0.765625
llama-2-7b-guanaco-checkpoint-2500 24 98 38 0.15000 0.61250 0.268750
llama-2-7b-guanaco-checkpoint-3500 22 99 39 0.13750 0.61875 0.259375
llama-2-7b-guanaco-checkpoint-3000 21 100 39 0.13125 0.62500 0.253125
llama-2-7b-guanaco-checkpoint-2000 20 107 33 0.12500 0.66875 0.228125
llama-2-7b-guanaco-checkpoint-1500 12 109 39 0.07500 0.68125 0.196875
llama-2-7b-guanaco-checkpoint-1000 10 119 31 0.06250 0.74375 0.159375
llama-2-7b-guanaco-checkpoint-500 6 129 25 0.03750 0.80625 0.115625
alpaca-13b 6 134 20 0.03750 0.83750 0.100000 6 134 20 0.03750 0.83750 0.100000