What?
Authors demonstrate the ability of DNNs to compress the training data and also measure how much current deep models actually compress data.
Why?
Compression is strongly related to generalization. Sample complexity bounds like VC-dimension, PAC-Bayes etc. are related to the compressed length of the data in a model, and any compression scheme leads to generalization bounds.
How?
Authors compared four different encoding methods i.e. (1) Two-part codes (2) Network Compression (3) Varionational codes (4) Online or Prequential codes against a uniform encoding baseline L^{\text {unif}}(y_{i:n} | x_{1:n}) = n \log_2 K , and found that:
- Two-part codes and Network compression techniques do much worse than uniform coding with the exception of sub-space selection method, but in this case test accuracy suffers a lot.
- Variational methods also provide poor compression, when comapred to compared to online or prequential codes, which give the best result. However, VI methods do better than Two-part and network compression. This is unexpected given that the objective of VI methods is equivalent to minimizing a description length.
- Incremental encoding methods i.e. prequential coding on top of standard learning, yields much better codelengths than standard VI and also correlates better with test set performance.
Note that: Shannon-Huffman code of a trained model which gives a probability p(y_i | x_i) for any pair (x_i, y_i) can be written as L_p(y_{i:n} | x_{1:n}) = - \sum_{i=1}^n \log_2 p( y_i | x_i ) , which is the same as categorical-cross entropy loss over n examples. Thus minimizing this loss is the same as minimizing the codelength i.e. the distribution that minimizes the loss is the same as the one which minimizes the codelength.
Also note that: Expected codelength is lower bounded by conditional entropy \mathbb{E}_q[ L(y | x)] \geq H( y | x ) thus the gain between some codelength L and trivial codelength is:
\underbrace{ H(y) - \mathbb{E}_q[ L(y | x)] }_{\text{Gain}} \leq \underbrace{ H( y) - H( y | x ) = I(y; x) }_{\text{Mutual Information}}In words, the gain of any codelength compared to the uniform code is limited by the amount of mutual information between input and output. Therefore, we can say that any successful compression of the labels corresponds to a direct estimation of the MI between input and output.
Quantifying learned regularity: The difference between the prequential codelength L^{\text{preq}}( y_{1:n} | x_{1:n}) and the log-loss - \sum_{t=1}^n \log_2 p_{\hat{θ}_{t_K}} (y_t | x_t) of the final trained model, can be interpreted as the amount of information that the trained parameters contain about the data.
And?
- Theoretically, three of the codes (two-parts, Bayesian, and prequential) are asymptotically equivalent under strong assumptions but deep learning models do not usually satisfy those assumptions.
- A weakness of prequential codes is the ‘catch-up phenomenon’, which is efficiently solved by ‘switching’.
- L^{\text{var}}_{\beta} (y_{i:n} | x_{1:n}) = L^{\text{Bayes}} (y_{i:n} | x_{1:n}) + KL(p_{Bayes} (\theta | y_{i:n} , x_{1:n} || \beta ) where \beta is the assumed variational distribution.
Be First to Comment