From Transcripts to AI Chat: An Experiment with the Lex Fridman Podcast | by Geronimo

When I started this endeavour, OpenLLaMA 7B and 13B just finished training on 1 trillion tokens. This base model is intendend as a reproduction of the LLaMa models for unrestricted use. To shorten training time I went for the smaller 7B model (7 billion parameters). The recently released Falcon models with 7 and 40 billion parameters would have been a good alternative since they also come with unrestricted licenses. Inference of the Falcon models however was very slow in my hands so I went for what worked well for me at the time.

I started off fine-tuning by using the script provided in the QLoRA repository for fine-tuning LLaMa-7B on the Open Assistant data to reproduce Guanaco, except that I used OpenLLaMa 7B and two GPUs.

export WANDB_PROJECT=openllama-7b-OASSTpython qlora.py 
--model_name_or_path models/open_llama_7b 
--output_dir ./output/guanaco-7b 
--logging_steps 10 
--save_strategy steps 
--data_seed 42 
--save_steps 500 
--save_total_limit 40 
--evaluation_strategy steps 
--eval_dataset_size 1000 
--per_device_eval_batch_size 16 
--max_new_tokens 32 
--dataloader_num_workers 3 
--group_by_length 
--logging_strategy steps 
--remove_unused_columns False 
--do_train 
--do_eval 
--do_mmlu_eval False 
--lora_r 64 
--lora_alpha 16 
--lora_modules all 
--double_quant 
--quant_type nf4 
--bf16 
--bits 4 
--warmup_ratio 0.03 
--lr_scheduler_type constant 
--gradient_checkpointing 
--dataset data/openassistant-guanaco/openassistant_best_replies_train.jsonl  
--dataset_format oasst1 
--source_max_len 16 
--target_max_len 512 
--per_device_train_batch_size 16 
--gradient_accumulation_steps 1 
--max_steps 1875 
--eval_steps 180 
--learning_rate 0.0002 
--adam_beta2 0.999 
--max_grad_norm 0.3 
--lora_dropout 0.1 
--weight_decay 0.0 
--seed 0 
--report_to wandb

Training on 2×3090 GPUs I noticed the obvious: increasing the per_device_train_batch_size makes things faster. According to the authors of the repository, per_device_train_batch_size * gradient_accumulation_steps should always be 16 (qlora/Readme.md: “Make sure to adjust `per_device_train_batch_size` and `gradient_accumulation_steps` so that their product is 16 and training fits on your GPUs.”). This is why i set per_device_train_batch_size to the maximum possible value of 16.

Only afterwards I found that the batch size influences model quality where higher batch sizes speed up the training process and lower batch sizes make the process slower but sometimes lead to better generalisation and therefore higher quality of the trained model.

The model was fine-tuned for 1875 steps, corresponding to ~3.8 epochs which took 4.5 hours. Here is the Weights & Biases report of this fine-tuning run:

I did a quick sanity check of whether this model is alive or something went terribly wrong.

Works!

Things to notice in the data above:

Training loss is not a smooth curve but rather spiky. After doing a bit of reading, this seems to be caused by the parameter group_by_length which is passed to qlora.py. According to the documentation, setting this flag leads to grouping of “sequences into batches with same length. Saves memory and speeds up training considerably.” Technically this makes sense, however, grouping totally unrelated question/answer pairs into the same context makes no sense to me and apparently is the reason for the spikes in training loss.
Evaluation loss goes up instead of down. Evaluation (and training) was done on answer and question. Even with good generalization, training on some questions will not predict the other questions being asked in the evaluation set. This loss behaviour is therefore expected and has been observed by other people too (one, two) but so far not been commented by the repository maintainers.
dataset_format is set to oasst1. This basically trains on the the entire text field as specified in the input JSON file, bascially corresponding to training on raw text.
target_max_len is set to 512 tokens meaning that only a maximum of this number of tokens are used for training. If the strings (or corresponding tokens) exceed 512 tokens, they will be ignored during training.
source_max_len is 16. This might be confusing. However, as described in the previous bullet point, we are training only on raw text, there is no input or source, the model will train on both answer and question. See the source code part of qlora.py where this is specified.

With the lessons learned above I used the following parameters to fine-tune OpenLLaMA 7B on the podcast transcripts.

export WANDB_PROJECT=open_llama_7b_lexfpodcast
export WANDB_WATCH=falsepython qlora.py 
--model_name_or_path /home/g/models/open_llama_7b 
--output_dir ./output/open_llama_7b_lexfpodcast 
--logging_steps 10 
--save_strategy steps 
--data_seed 42 
--save_steps 100 
--save_total_limit 40 
--evaluation_strategy steps 
--eval_dataset_size 1000 
--per_device_eval_batch_size 2 
--max_new_tokens 32 
--dataloader_num_workers 3 
--logging_strategy steps 
--remove_unused_columns False 
--do_train 
--do_eval 
--do_mmlu_eval False 
--lora_r 64 
--lora_alpha 16 
--lora_modules all 
--double_quant 
--quant_type nf4 
--bf16 
--bits 4 
--warmup_ratio 0.03 
--lr_scheduler_type constant 
--gradient_checkpointing 
--dataset data/lexf-podcasts/data_v0.3.jsonl  
--dataset_format oasst1 
--source_max_len 16 
--target_max_len 2048 
--group_by_length False 
--per_device_train_batch_size 2 
--gradient_accumulation_steps 8 
--max_steps 2000 
--eval_steps 100 
--learning_rate 0.0002 
--adam_beta2 0.999 
--max_grad_norm 0.3 
--lora_dropout 0.1 
--weight_decay 0.0 
--seed 0 
--report_to wandb 
--max_memory_MB 23000

The model was fine-tuned for 2000 steps, corresponding to ~4.4 epochs, each epoch took around 4 hours for a total training time of ~18 hours.

Here is the Weights & Biases report:

The model was trained on a relatively simple desktop running Ubuntu and 2x RTX 3090, for people interested in multi-GPU homebrew training, this is how my GPUs handled the task:

Not sure what the dip in the first half corresponds to, maybe (for whatever reason) I closed the windows after 10 hours

Prompting the resulting model with simple questions is not a lot of fun, it is incredibly nosy and keeps on asking questions, digging deeper and deeper, without any reaction to my answers. So using it like a simple chatbot is not very entertaining. I was curious, but to be fair, this is not how it was trained.

Confronting the model with adequate prompts such as the beginning of a podcast episode yielded better results.

The following data has been generated using checkpoint 1000, ie. after 1k training steps, corresponding to 3.3 epochs and a training loss of 1.8 (shown in the figures above). People usually train for 3–5 epochs and shoot for a training loss of ~1.5 but in general there seems to be no gold standard on for how long you train.

This is the prompt I used to start a conversation:

### Assistant: This is my conversation with Barack Obama, an American politician and attorney who served as the 44th President of the United States from 2009 to 2017. Known for his eloquent oratory and progressive policies, he made history as the first African American to hold the presidency. During his two-term tenure, Obama championed healthcare reform with the Affordable Care Act and led initiatives on climate change, notably signing the Paris Agreement. His presidency, characterized by a focus on inclusivity and diplomacy, has been influential in shaping contemporary American politics.

Mr. Obama, welcome and thank you for being here. Let me start by asking a provocative question: Why

And this is what followed. Please note that both, assistant and human text (in this case virtual Lex and virtual Obama) have been generated by the model (special token </s> was excluded).

Virtual podcast episode with Barack Obama, take 1

### Assistant: This is my conversation with Barack Obama, an American politician and attorney who served as the 44th President of the United States from 2009 to 2017. Known for his eloquent oratory and progressive policies, he made history as the first African American to hold the presidency. During his two-term tenure, Obama championed healthcare reform with the Affordable Care Act and led initiatives on climate change, notably signing the Paris Agreement. His presidency, characterized by a focus on inclusivity and diplomacy, has been influential in shaping contemporary American politics.

Mr. Obama, welcome and thank you for being here. Let me start by asking a provocative question: Why do you think Donald Trump won the 2016 presidential election?
### Human: I’m going to give you a very honest answer that may not be politically correct because it’s important for people to understand why we ended up where we are today. The reason that Mr. Trump was able to win despite his manifest unfitness for office was that there were millions of Americans who voted for him while knowing that he wasn’t fit for office. They did so either because they didn’t care about what he said or what he tweeted or whether he would tell lies every day, or they genuinely believed that he would somehow make America great again. And those people felt left behind economically. They had lost their jobs in part because of automation and globalization and didn’t feel like anybody was listening to them. And they saw this billionaire businessman who was gonna shake things up and do something different. Now, having said all that, if I had to guess what might have happened differently, If somebody like Jeb Bush or Marco Rubio or John Kasich had gotten elected instead of Donald Trump, even though they weren’t exactly inspiring, at least they seemed more likely than Mr. Trump to surround themselves with competent people and actually govern effectively.
### Assistant: But isn’t that still an issue of character?
### Human: No doubt. It’s an issue of character. Character matters. That’s why I wrote two books about it. So yes, character mattered. But these other candidates also didn’t go around saying whatever pops into their head and then doubling down when they get called out on stuff. You know, there’s a real difference between making mistakes and changing your mind versus just lying about stuff over and over and then denying when confronted with evidence that you’ve lied. There’s a big difference between someone who says mean things about immigrants but then goes to visit a detention center and listens to the conditions inside and ends up feeling bad about himself afterwards, versus someone whose whole worldview is based on hate.
### Assistant: What do you think is the biggest mistake that Hillary Clinton made during her campaign against Donald Trump?
### Human: Well, there were several mistakes, but one of them was that she thought she could take him seriously as a candidate and therefore needed to treat him as a serious opponent rather than the joke that he was. And so she adopted some of his positions herself. She went after his core supporters, she talked about building walls, she gave a speech in which she basically said that she was going to put coal miners out of work. So it’s not surprising that those people decided to vote for him instead of her. If you want to beat somebody like that, you shouldn’t try to outdo them. Just simply say, “I’m better than you.” Beat them soundly.
### Assistant: Okay, let’s talk briefly about Russia and Vladimir Putin. He’s arguably the most powerful man in the world. In your new book, A Promised Land, you write that you believe Putin wanted to undermine confidence in our democracy. To what extent does his actions contribute to the sense of distrust that exists among Americans?
### Human: Well, I don’t think that his actions alone explain it. But they certainly didn’t help. Because even before the Russian interference, we already had a pretty strong anti-Russian bias. We’d just finished eight years under a president who spent most of his time obsessing about the evils of Russia and trying to build a wall between us and Mexico. So when we got hit with this barrage of Russian propaganda during the campaign, it probably reinforced people’s existing views.
### Assistant: Is there anything positive that can come from the fact that so much of our government officials now believe Vladimir Putin? Does it make us stronger? Does it make us weaker?
### Human: I think it makes us weaker because it means that we’re giving way too much credence to a foreign leader.

Virtual podcast episode with Barack Obama, take 2

### Assistant: This is my conversation with Barack Obama, an American politician and attorney who served as the 44th President of the United States from 2009 to 2017. Known for his eloquent oratory and progressive policies, he made history as the first African American to hold the presidency. During his two-term tenure, Obama championed healthcare reform with the Affordable Care Act and led initiatives on climate change, notably signing the Paris Agreement. His presidency, characterized by a focus on inclusivity and diplomacy, has been influential in shaping contemporary American politics.

Mr. Obama, welcome and thank you for being here. Let me start by asking a provocative question: Why do you think that Donald Trump was elected president? I don’t mean politically speaking, but why do you think he won?
### Human: Well, there are many reasons. And some of them had nothing to do with him, right? So it’s worth understanding how this happened. I spent eight years trying to get people to vote. We got record turnouts in ’08 and ’12, which were great. Turnout fell back down in ’16. And so while we can argue about what percentage of that dropoff was due to voter suppression efforts, whether it was Russian interference, whether it was just apathy, you know, there were legitimate reasons why people didn’t show up at the polls. But then in 2020, you know, despite everything else going on, you know, including a pandemic, you know, a significant percentage of people turned out again. So if you look at all those factors together, we had a lot of people who were eligible to vote who weren’t voting because they couldn’t be bothered. They felt like their votes didn’t matter. Maybe they believed the falsehoods spread by folks like Mr. Trump suggesting that our elections were rigged. Then you add to that mix somebody who comes along and says, “I alone can fix” whatever problem you want to talk about. “Everything is terrible.” That message resonated even though most Americans thought that the economy was doing pretty well. Most Americans thought that our foreign policy wasn’t working. Most Americans said that we needed to deal with health care. And yet somehow
### Assistant: The singular man, the one person could solve everything.
### Human: Exactly. Even though everyone knew that nobody’s one person can solve anything. You need to have coalitions, you need compromises. Somebody who’s willing to say that they’re going to ignore the Constitution when it suits their purposes. He said things during the campaign that he still hasn’t admitted publicly, although others have come forward to tell us exactly what he said and meant behind closed doors. It doesn’t surprise me that a number of suburban women voted against him because they heard him saying these things. Whether it was about his treatment of women, whether it was about his comments about African Americans or Latinos, and now we hear reports about his own taxes. None of it surprises me. What does surprise me is how hard so many of our fellow citizens are working to normalize what this guy is doing. How often they give him credit for good economic news. I remember during one of the debates, someone asked me about something that I had done in office, and he interrupted before I even answered and said, “You didn’t create jobs, I did.” As if it was his job to create jobs. Then he goes on to brag about how many cars he sold or how many houses he built. Meanwhile, millions of people lost their jobs during that same period. Millions of families lost their homes during that same period. Yet somehow he gets the credit because his businesses were doing better than other people’s businesses. Or he gets the credit for
### Assistant: because the stock market went up.
### Human: Because the stock market went up, which means very little unless you happen to be invested in stocks yourself. And I suspect that many of his voters aren’t.
### Assistant: If you’re listening to this interview, please do not call into any of the shows where I’m appearing and say that I said that the stock market went up because of Barack Obama. I never once in my life talked about the stock market and its relation to anything except actual performance. I am not responsible for the rise of the Dow Jones Industrial Average.
### Human: There’s no doubt that he used every tool available to him to try to boost the economy during his time in office. There’s also no doubt that he inherited an economy that was already showing signs of improvement when he came into office. In fact, unemployment was falling before he took over. When you look at the totality of what he did, especially compared to what George W. Bush did, which was basically throw gas on the fire after 9/11, and then go ahead and take us to war in Iraq without bothering to find out whether we should or shouldn’t. That was a disaster economically and otherwise. I’m not excusing myself. I’m just saying that I tried to do positive things for the country. I focused on education. I focused on expanding access to affordable healthcare. I worked to make sure that we were investing more in infrastructure. And I kept telling everybody that it was going to cost money. And guess what? People wanted lower taxes. So if you want to spend money, you have to pay for it. Either you tax people less, which would mean they wouldn’t have enough money left over to buy stuff, so therefore, demand would fall. Or you raise revenue through higher taxes, which means that some people might decide they don’t want to buy your product anymore, but hopefully you’ve diversified your customer base and you’ll be okay. But anyway, that was the challenge. But what
### Assistant: was the attitude towards government spending during that time? Was there a Republican Congress?
### Human: Yes. Yes, there was. And throughout most of my time in public service, whenever anyone would propose making investments in infrastructure or research and development or early childhood education programs or things that will help lift people up, instead of coming back and saying, “We support that,” what they’d say is, “Show me the money!” And the reason they would say that is because historically, Republicans and Democrats alike, until very recently, understood that we live in a society in which we all benefit when somebody else succeeds. Nowadays, too many people seem to believe that if someone else is getting something for free, it must necessarily be bad for them. That’s just simply not true. And so I think part of what contributed to this environment of rising inequality and stagnant wages is that so much of what we saw happening was justified based on this idea that the pie is shrinking. Whereas in reality, there were plenty of opportunities to grow the pie. Just as we had an opportunity during my time in office, but also prior to my time in office, to invest in things that would actually generate new wealth rather than just redistribute existing wealth.

Not terribly good but a good start. More takes of this conversation using all the checkpoints and the same “Obama”-prompt can be found here.

Things move so quickly in the field of open source large language models that, while working on this, several improvements in the field have been released.

Bigger and better models. OpenLLaMA also provides 13B versions of their models and even versions 2 and 3 of their foundation models which have been pre-trained on additional data.
Facebook research also recently released a version 2 of their popular LLaMa models, this time licensed for both researchers and commercial entities.
Increased context size. The OpenLLaMA model I used here is restricted by a context size of 2048 tokens which is obviously a limitation for training a model to generate long-format conversations. Recently, several people have discovered ways of increasing the context size several-fold. See this and this, and many others which I probably missed. The Subreddit LocalLama is a good place to start looking for hot stuff. Also, Llama 2 provides 4k context out of the box.
Only train on target. The approach above is a fine-tune on raw text including question and answer. By splitting each character’s turn (ie. question and answer) and only train on the response I would assume that the model’s quality will improve. Guanaco was trained the same way as described here simply because the raw text approach apparently yielded a good enough chatbot already so there was no incentive to go further and put more effort into.

QLoRA provides a way to fine-tune available foundation models at home in reasonable amounts of time
Data preparation is key but might be laborious (depending on the use case), look at the raw data to make sure it is what you think it is (otherwise: shit in, shit out)
Standard training hyperparameters may work but not yield optimal results, understand all wheels and nobs, change them, see what happens
Observe the training process, using wand.ai for example, see if the numbers and plots make sense
Learn by doing

I learned a lot in the process and it is my hope that you, too, gained some insights from this tutorial.

Source link