Entropy

Probabilistic Machine Learning, Murphy

The optimal coding scheme will allocate fewer bits to more frequent symbols (i.e., values of Y for which p(y) is large), and more bits to less frequent symbols.
A key result states that the number of bits needed to compress a dataset generated by a distribution p is at least H(p); the entropy therefore provides a lower bound on the degree to which we can compress data without losing information.
The H(p, q) term is known as the cross-entropy. This measures the expected number of bits we need to use to compress a dataset coming from distribution p if we design our code using distribution q.
Thus the KL is the extra number of bits we need to use to compress the data due to using the incorrect distribution q.