LLM Optimization
LLM optimization techniques
distillation
- = a larger teacher model train a smaller student model.
- the student model learns to statistically mimic the behavior of the teacher model
- either just in the final prediction layer
- or in the model's hidden layers as well
- typically effective for encoder-only models

quantization
- = reduce weight's precision
- example: Post-Traning Quantization (PTQ)

pruning
- = remove model weights with values close or equal to 0
- pruning methods
- full model re-training
- PEFT/LoRA (LLM Finetuning#PEFT techniques)
- Post-training (Foundation Models#Post-training)
- in theory vs. in practice
- in theory, it can reduce model size and improve performance
- in practice, only small % in LLMs are 0-weights