AI model Inference Optimization

Inference performance metrics

Latency metrics

= measure the time from when users send a query until they receive the complete response

The overall latency can be broken into several metrics:
- Time to first token (TTFT) = how quickly the first token is generated after users send a query
- Time per output token = how quickly each output token is generated after the first token
- Time between tokens and inter-token latency = the time between output tokens

Utilization metrics

= measure how efficiently a resource is being used

MFU (Model FLOP/s Utilization) = the ratio of the observed throughput (tokens/s) relative to the theoretical maximum throughput of a system operating at peak FLOP/s
MBU (Model Bandwidth Utilization) = the percentage of achievable memory bandwidth used

Inference Optimization

Inference optimization can be done at different levels:

model level
hardware level
service level

Inference model optimization

optimize model size
- model compression
optimize autoregressive decoding
- speculative decoding
- inference with reference
- parallel decoding
optimize attention mechanism
- attention mechanism optimization
- redesigning the attention mechanism
- writing kernels for attention computation

Inference service optimization

batching
decoupling prefill and decode
prompt caching
parallelism