Paper Notes # 6 Weakly-Supervised Disentanglement Without Compromises

Paper URL: http://proceedings.mlr.press/v119/locatello20a/locatello20a.pdf

What?

Authors formally prove that if we have knoweldge about how many factors change between pair of observations, but don’t know which factors change, we can still learn disentangled representations from such pairs i.e. disentanglement is identifiable.
They provide a method which given such pairs learns disentangled representation
Since in real world settings we may not even know exactly how many factors have changed, they provide a heuristic which can estimate this directly from data.

Why?

It has been shown that unsupervised learning of disentangled representations is impossible (on iid examples) without appropriate inductive biases in the learning method and data. Their training regime with paired (thus, non-iid) observations circumvents this and tweaks the objective to utilize the knowledge that there are shared factors between the paired images as weak-supervision.

How?

Heuristic to determine $k$ i.e. number of changed or varying factors:

Intuitively, dimensions corresponding to the shared factors will have lower KL divergence i.e. $D_{KL}(q(z_i|\textbf{x}_1) || q(z_i|\textbf{x}_2)) )$ . Using this, authors define a threshold $\tau$ as follows:

\begin{aligned} \tau &= \frac{1}{2} ( \max_i \delta_i + \min_i \delta_i) \\ \text{where} \;\;\;\; \delta_i &= D_{KL}(q(z_i|\textbf{x}_1) || q(z_i|\textbf{x}_2)) ) \end{aligned}

Thus, dimensions with $\delta_i \lt \tau$ are inferred as shared between $\textbf{x}_1$ and $\textbf{x}_2$ (and are coordinate-wise averaged between the pair, details in the following para). The number of such shared dimensions gives us $d - k$ , and since we already know $d$ , we can easily get $k$ . It is however, not explicitly required for training.

Using knowledge of shared factors to enforce constraints:

Let $S$ (resp. $\bar{S}$ ) be the set of indices of shared (resp. non-shared) factors. Intuitively, we want to have the conditional probabilities match between the pair when factors are shared. For example, if we know that both images are “red”, we would want for dimension corresponding to color $p(z_{color} = \text{red} | \textbf{x}_1) = p(z_{color} = \text{red} | \textbf{x}_2)$ . Thus, we have the following complementary constraints:

\begin{aligned} p(z_i|\textbf{x}_1) = p(z_i|\textbf{x}_2) \;\;\; \forall i \in S, \\ p(z_i|\textbf{x}_1) \neq p(z_i|\textbf{x}_2) \;\;\; \forall i \in \bar{S} \end{aligned}

However, we don’t have access to the actual distribution and we are using the variation distribution $q(\textbf{z}|\textbf{x})$ to approximate it. Thus we will enforce these constraints on that distribution as well i.e. on $q(\textbf{z}|\textbf{x})$ . That is, if the dimensions are inferred to correspond to shared factors we average their value across the pair, otherwise we leave it unchanged.

q(z_i|\textbf{x}_{\{1,2\}}) = \begin{cases} Average(q(z_i|\textbf{x}_1) ,q(z_i|\textbf{x}_2) ) , & \forall i \in S \\ q(z_i|\textbf{x}_{\{1,2\}}) , & \text{else} \end{cases}

In total, for the all such pairs, authors introduce and optimize the following modified $\beta$ -VAE objective:

\begin{aligned} \max_{\phi,\theta} \mathbb{E}_{(\textbf{x}_1,\textbf{x}_2)} & \bigg\{ \mathbb{E}_{q_\phi(\textbf{z}|\textbf{x}_1)} \log(p_\theta(\textbf{x}_1|{\textbf{z}})) \\ &+ \mathbb{E}_{q_\phi(\textbf{z}|\textbf{x}_2)} \log(p_\theta(\textbf{x}_2|{\textbf{z}})) \\ &- \beta D_{KL}( q_\phi({\textbf{z}}|\textbf{x}_1)||p({\textbf{z}})) \\ &- \beta D_{KL}( q_\phi({\textbf{z}}|\textbf{x}_2)||p({\textbf{z}})) \bigg\} \end{aligned}

Creating non-iid paired datasets:

In order to test the method they create datasets with such paired images such that each pair of images differs in at most $k \in [1,d-1]$ FoVs. Method is then evaluated in both scenarios (1) when $k$ is fixed throughout the dataset and (2) when $k$ varies i.e. pairs can have a variable number of common FoV values.

How does weak-supervision perform for learning disentanglement?

Their experiments over several datasets, $k \in [1,d-1]$ values, and hyperparameter settings show the following trends:

Representations learned using their weakly-supervised approach performs better than unsupervised baseline (of 9000 models). Concretely, weak-supervised representations show better generalization under covariate shift, lower sample complexity on down-stream task of Image Matrix Completion / Abstract Reasoning, as well as high Fairness score.
Method performs successfully on different values of $k$ . However, performance is better for lower $k$ values i.e. when number of varying factors are lower. Similarly, when $k$ is known in advance, this knowledge can be used to improved performance further.

And?

They mentioned that they don’t require controlled data acquisition but created the paired datasets synthetically using a generative process over which they had complete control. This isn’t really applicable in real-world setting where we would just have non-paired / single examples and creating such pairs is extremely cumbersome, if not impossible. This can however work in temporal or spatial signal acquisition cases where observations change smoothly, as was provided as motivation.
Creating pairs is also challenging in case when the domain of factors vary smoothly e..g in MNIST. How would we find two images which have same ‘thickness’ or ‘rotation’ value? These are vague / approximate notions so we probably need a notion of “factors are approximately shared” or “close enough in latent space”.

Paper Notes # 6 Weakly-Supervised Disentanglement Without Compromises

What?

Why?

How?

And?

Be First to Comment

Leave a Reply Cancel reply