Blog: This Google Experiment Destroyed Some of the Assumptions of Representation Learning
Building knowledge in high dimensional datasets is one of the fundamental challenges of modern deep learning applications. As humans, we are extremely proficient of reasoning in a small number of dimensions but information represented in a large number of dimensions results mostly incomprehensible. One ability of humans cognition that proves helpful when understanding large dimensional datasets is our ability to decompose the world in smaller and somewhat disconnected pieces of knowledge. In the context of deep learning, the equivalent to that skill is known as disentangled representations. Recently, artificial intelligence(AI) researchers from Google published a paper that challenges the traditional understanding of disentangled representations.
The idea of disentangled representations is that an agent would benefit from separating out (disentangling) the underlying structure of the world into disjoint parts of its representation. This idea draws inspiration from physics such as world symmetries that state that any world can be modeled in different layers that are connected by symmetrical transformations. The definition of symmetrical transformation is one that leaves certain properties of the source object invariant. One of the ultimate representations of this idea is the Noether’s Theorem which states that that every conservation law is grounded in a corresponding continuous symmetry transformation. For example, the conservation of energy arises from the time translation symmetry, the conservation of momentum arises from the space translation symmetry, and the conservation of angular momentum arises due to the rotational symmetry. Disentangled representations attempt to decompose an environment into different representations that are easier to learn and virtually disconnected from each other. From the theoretical standpoint, disentangled representations are part of a broader branch of deep learning known as representation learning.
Disentangled Representations and Representation Learning
Conceptually, representation learning can be defined as a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. The goal of representation learning is to replace manual feature engineer which is completely unpractical in large dimensional datasets. From a more mathematical standpoint, one of the key assumptions of representation learning is that that real-world observations x (e.g., images or videos) are generated by a two-step generative process. First, a multivariate latent random variable z is sampled from a distribution P(z). Intuitively, z corresponds to semantically meaningful factors of variation of the observations (e.g., content + position of objects in an image). Then, in a second step, the observation x is sampled from the conditional distribution P(x|z). The key idea behind this model is that the high-dimensional data x can be explained by the substantially lower dimensional and semantically meaningful latent variable z which is mapped to the higher-dimensional space of observations x.
Disentangled representations extend the basic principles of representation learning with the idea that the most effective knowledge representations of an environment are based independent features(disentangled) in such a way that if one feature changes, the others remain unaffected. While there is no single formalized notion of disentanglement which is widely accepted, the key intuition is that a disentangled representation should separate the distinct, informative factors of variations in the data.
Consider the following image dataset in which each panel represents one factor that could be encoded into a vector representation of the image. The model shown is defined by the shape of the object in the middle of the image, its size, the rotation of the camera and the color of the floor, the wall and the object.
A disentangled representation captures the knowledge if this environment in a 10-dimensional vector. In the image below, the top right and the top middle panel show that the model has successfully disentangled floor color, while the two bottom left panels indicate that object color and size are still entangled.
Given that disentangled representations focus on learning representations about an environment, it seems like a perfect unsupervised learning task. Over the years, the deep learning community has produced different unsupervised methods to learn disentangled representations. Most of those methods are based on variational autoencoders and, although they have proven effective in some narrow scenarios, there is very empirical evidence that demonstrates whether these methods can be generalized.
The Google Experiment
Google decided to evaluate six different state-of-the-art models (BetaVAE, AnnealedVAE, FactorVAE, DIP-VAE I/II and Beta-TCVAE) and six disentanglement metrics (BetaVAE score, FactorVAE score, MIG, SAP, Modularity and DCI Disentanglement). In total, the experiment trained and evaluated 12,800 such models on seven data sets. The study showed some surprising conclusions that challenge some of the common wisdom about using unsupervised learning for disentangled representations.
Initially, the Google study proved that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases both on the considered learning approaches and the data sets. This means that specific inductive biases such as regularization models or the choice of the neural network architecture are required in order to learn disentangled representations.
In addition to that initial result, the study did not find any empirical evidence that the considered models can be used to reliably learn disentangled representations in an unsupervised way, since random seeds and hyperparameters seem to matter more than the model choice. The following figure illustrates that the choice of random seed across different runs has a larger impact on disentanglement scores than the model choice and the strength of regularization. A good run with a bad hyperparameter can easily beat a bad run with a good hyperparameter.
Finally, for the considered models and data sets, the study was unable to validate the assumption that disentanglement is useful for downstream tasks, e.g., that with disentangled representations it is possible to learn with fewer labeled observations. The following matrix ranks correlations between disentanglement metrics and downstream performance in the form of accuracy and efficiency. As you can see, there are no clear correlations on the results.
In addition to the research paper, Google open sourced disentanglement_lib, a library that allows to reproduce the experiments with a few commands. Google’s work constitutes the largest empirical study about disentangled representations ever conducted. Some of the results should definitely influence the direction of research in this important area of deep learning.