Why do we need model optimization? Different methods of model optimization?
Deep Learning model optimization involves methods to speed up model inference. This comes at the cost of a decrease in performance.
1. Foundational Models
These days there are lots of pre-trained foundational models available in both the vision and text domain. Models like Segment Anything and DINOv2 are pretrained on millions (Segment Anything and DINOv2 are trained on 11, and 142 million data respectively) of data. In the text domain, we have Llama 2, vicuna, and so on. Finetuning them on new tasks will always give better results than training anything from scratch. However, these foundational models are generally large vision/language models. The idea here is to take advantage of the large-scale pretraining.
One problem with foundational models is that they will require high resources during training even if we optimize the inference. Recently papers like LoRA (Low-Rank Adaptation of Large Language Models), QLoRA (Efficient Finetuning of Quantized LLMs), and so on have reduced the computational requirement of training those large foundational models. These papers fall inside the category called Parameter efficient finetuning methods, which is getting lots of attention because of these foundational models.
However, because of the large model size, we will not always be able to deploy them in production and need optimization methods to make the inference faster.
On a side note, vision language models (like CLIP, BLIP-2, and so on) can also be used as pretrained encoders depending on tasks.
2. Train large then compress is better than training compressed model
There are two aspects while comparing a large model and compression against a compressed model. Normally compressed model will complete an iteration much faster than a large model. So, will training time and compute requirements increase? Also, after the compression of the large model, how much performance is lost?
There is a paper Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers from UC Berkeley, which mentions wider and deeper models converge much faster than smaller models, and are much easier and more robust to compress compared to smaller models.
Along with parameter-efficient fine-tuning and the finding that large models converge faster, training time and compute requirements will not be a problem.
Also, from the Validation Accuracy vs Number of Parameter diagram, it’s clear that for a given number parameter large heavily compressed models work better than smaller and less compressed models.
3. Training large models and then optimizing makes more sense for business
More often than not the models need to be deployed on multiple devices like cloud deployment, and on-premise deployment. On cloud deployment, we can use larger models but for on-premise deployment, we might need to optimize based on client requirements. Training models separately and maintaining them will be more hassle than training one large performing model and optimizing it according to requirements.
Two of the most common model optimization methods are Quantization and Pruning.
Quantization stores the model weights in low-precision (fp32 to fp16/int8/bfloat16) format to accelerate matrix operations when using hardware with reduced precision and to reduce overall memory footprint. During quantization, we occasionally quantize only part of the weights and activations instead of quantizing everything.
Post-training quantization is a technique to quantize (clip, scale, round, and so on) the model weights after the model has been trained.
Quantization-aware training is a training method that involves simulating the quantization effects during the training phase itself. Instead of training the model with high-precision floating-point numbers, the model is trained with lower-precision representations, emulating the quantization that will be applied later during deployment. This helps the model adapt to the reduced precision and minimizes the loss of accuracy caused by quantization. Quantization-aware training is often better than post-training quantization. Also, don’t confuse this with mixed-precision training.
Model pruning sets the network weights to zero which reduces the number of operations directly and reduces the memory footprint as well.
Pruning algorithms can drop individual weights or groups of weights. They are called unstructured and structured pruning respectively.
Note, that model inference can be optimized with parallel model inference, converting the model to a static computational graph, and so on.
We will go over how the different quantization and pruning techniques are used in a model in separate blogs.
Connect with me
Feel free to drop me a message or
Have a nice day ❤️