LLM Optimization

LLM optimization techniques

distillation

= a larger teacher model train a smaller student model.
the student model learns to statistically mimic the behavior of the teacher model
- either just in the final prediction layer
- or in the model's hidden layers as well
typically effective for encoder-only models

quantization

= reduce weight's precision
example: Post-Traning Quantization (PTQ)

pruning

= remove model weights with values close or equal to 0
pruning methods
- full model re-training
- PEFT/LoRA (LLM Finetuning#PEFT techniques)
- Post-training (Foundation Models#Post-training)
in theory vs. in practice
- in theory, it can reduce model size and improve performance
- in practice, only small % in LLMs are 0-weights