Blog: Are generative models good enough? A case study on class modeling
Analyzing via a case study on class modeling whether generative models are good enough in terms of approximating density distributions of patterns.
Motivation: Visual Concept Learning
Visual concept learning is an essential subtask of semantic vision.
Consider the tasks of image caption generation or visual question answering. These tasks suppose the capability of an agent to say that some objects with certain attributes and in certain states are present in the image (e.g. “the happy woman is dancing”), i.e., to associate an image representation with words forming corresponding concepts.
One can try to train an image caption generation or visual question answering model end-to-end hoping that it will learn the necessary concepts on its own together with learning how to solve the downstream task. However, it is quite tricky, and the quality of learned concepts appears to be rather low.
When it is not known which part of the image corresponds to what concept, error signals are very weak and not too informative. That is why image representations are usually pre-trained with strong supervision (e.g., on ImageNet and/or Visual Genome).
Why can we not just use a pre-trained classifier to recognize visual concepts? Unfortunately, we do not have full information about labels that should and should not be assigned to certain images or their regions. For example, if some bounding box in the image is labeled as a T-shirt, it does not mean that it cannot be labeled as a person or a woman. If it is labeled as red, it can also be partially green, and so on.
Thus, it is natural to state the task of training by positive examples — only assuming that classes are not mutually exclusive and can intersect arbitrarily.
We want our agent to be able to tell if something is, say, a cat — independent of its knowledge of all possible alternatives. Even if it has never seen a platypus, it could say that this image does not correspond to any known concept (except for ‘an animal’ possibly). This statement (known as one-class classification or class modeling) corresponds to anomaly or novelty detection tasks and is of considerable practical interest.
When it is necessary to describe a certain class and then be able to identify its elements, distribution of elements of this class (a membership function or a probability density function) should be modeled.
Generative models are precisely intended for this purpose since they are trained to approximate the likelihood function p(x) by its maximization on observed patterns x. If some generative models ideally describe distributions of some classes, then the task of recognition can also be ideally solved. In other words, if such models of two classes assign non-zero probability for a certain image, then either the classes really overlap (and the image belongs to both of them), or the image is not informative enough (low contrast, small, blurry, etc.) for telling, which of two mutually exclusive classes it belongs to.
However, it is well-known that modeling the whole density distribution of patterns in classes is much more complicated than just drawing a border between them — which is the reason for discriminative models being much more practical in the limited supervised learning settings.
Recently, we have seen considerable progress in deep generative models achieving the capability to generate diverse high fidelity images (meaning that samples, which are considered likely by the model, are also likely in terms of a distribution of real images). However, are they good enough in terms of approximating density distributions of patterns?
The simplest probabilistic models to describe pattern distributions within each class are Gaussians. With their example, one can understand both the meaning of generative models for concept learning, as well as some possible difficulties.
It is evident that in some cases Gaussians can approximate real distributions awfully, so it would not be possible to distinguish separable classes (an example is shown in the following picture).
Nevertheless, such models can be reasonably good in some cases and can serve as a simple baseline.
Let us consider MNIST as an example and try to describe distributions of images of each digit with Gaussians. Here, we immediately face the first technical issue, i.e., the covariance matrix is degenerate, and its inversion required by the equation of Gaussian distribution cannot be directly computed. The reason for this is that some pixels in the images of the training set are always black (or always white, or brightness of some pairs of pixels is 100% correlated), i.e., the training images lie on some lower-dimensional hyperplane in the space of all possible images.
We can approximate the distribution of images by a Gaussian on this hyperplane, but we will have zero dispersion in the directions perpendicular to it, and thus the probability assigned to images not lying on this hyperplane will be strictly zero. At the same time, the test set will contain images (of same classes), which do not lie on the corresponding hyperplanes. Even such a simple model as an (unconstrained) normal distribution overfits to the data.
Bayesian approach tells us that some priors should be imposed on the parameters of the model (e.g., Inverse-Wishart distribution can be used), but considering this problem from a simplified coding perspective could be more intuitive. Fitting a Gaussian distribution corresponds to the Principal Component Analysis (PCA) since principle components are the eigenvectors of the covariance matrix.
Patterns can be represented as their projections onto these components (that will correspond to the latent code for this pattern). However, if the pattern lies outside the subspace formed by these components, its reconstruction will be imprecise. Thus, lossless encoding of an input pattern will consist of its latent code (which length, in bits, can be derived from the estimated Gaussian distribution) and reconstruction residuals (which description length is a reconstruction error). Both these components can characterize the degree, to which a pattern belongs to a modeled class.
Let us first consider the probability assigned by the Gaussian distribution. The source code for all experiments can be found on Github.
Log-likelihood in the PCA subspace
If we train our model on images of one digit and test it on images of another digit, we can observe the following histograms of (shifted negative logarithms of) probabilities assigned to the patterns (their projections onto PCA subspace) by the model with different sizes of the latent code.
With 10 latent variables, probabilities for “real” and “fake” patterns are badly separated, while this separation for 100 latent variables is reasonable, but far from perfect (which is expected, considering the simplicity of the model).
If one takes more similar digits like “3” and “5”, the separation will be quite bad even for 100 latent variables.
It would be useful to qualitatively estimate the degree of separation of “real” patterns (for describing which distribution the model is constructed) from “fake” patterns by imposing some threshold on the log-likelihood. For example, the optimal threshold gives 15.1% of “recognition” errors while trying to identify “real” images of “3” in the presence of “5” as “fake” images.
This result can be slightly improved if we find the most appropriate size of the latent code. For example, 14.25% of “recognition” errors can be achieved for 84 latent variables while separating “3” and “5”:
It should be noted that the distribution of “fake” patterns is not used here (except for choosing a threshold), so images of any other class could be considered. We just took a similar class as a difficult case. As was noted earlier, the task is to construct a model for a class of patterns given only its positive examples — to make it possible to successfully answer the question if some pattern belongs to this class even if it belongs to some class which has never been encountered before.
This task is more complicated than the traditional classification task. Since shallow models give far from perfect results even for classification, the achieved recognition rate should not be immediately discouraging, and it is natural to ask if deep neural networks can achieve a satisfactory precision in the modeling of distributions of at least MNIST images.
Let us consider adversarial autoencoders (AAE). They, in essence, perform a sort of non-linear PCA — its encoder E(x) projects input pattern x into the space of latent variables z, while its decoder or generator G(z) performs the reconstructions. To ensure that the posterior distribution of latent variables has the desirable form (e.g., standard normal distribution), an additional adversarial loss is imposed with the use of a discriminator that tries to distinguish latent codes produced by the encoder from random vectors sampled from a prior distribution.
After training, one can sample z from the specified prior distribution, e.g., z~N(0, I), and generate image x=G(z). If the model is good, generated images will be distributed like real images. But how to estimate the probability density for some real pattern x, which is implicitly assigned by the model? Is it correct to substitute the latent code constructed by the encoder z=E(x) into the prior distribution N(0, I)?
Not precisely. This can be easily seen if we take a uniform distribution as priors. All values of z (in a certain range) will be equiprobable, while the non-linear generator can produce arbitrary distributions of x from these z:
If we have some density of patterns p(z) around z and want to estimate their density after transformation G(z), we need to calculate how unit volume is changed by G at this point. In brief, p(z) should be multiplied by |inv(J(G(z)))|, where J is the Jacobi matrix. More precisely, we should use a product of singular values of this matrix, since it is usually not square. However, we will skip the technical details here.
Let us consider an adversarial autoencoder. We used the encoder with three convolutional and two dense layers, and the decoder with one dense and four convolutional layers. The discriminator used in the adversarial loss computed over the latent code contained three dense layers. Batch normalization and spectral regularization were used.
Let us consider the negative log-likelihood (or simply modules) of codes (projections) z=E(x), where E is an encoder.
By choosing an optimal threshold, 26.6% and 14.2% of recognition errors can be obtained.
Using the probability density in the space of input patterns x improves these scores by some percentages. For example, the same trained model with 10 latent variables can show improvement from 27.19% to 23.44% errors.
However, it is apparent that overlapping distributions in the space of z will remain overlapping in the space of x, and improved criteria will not help to distinguish them.
So, let us consider the distribution of each component of z (for ten components) for “real” and “fake” classes.
It can be seen that latent variables of “real” patterns are distributed normally with zero mean (and with very low correlations in fact). In turn, some latent variables of “fake” patterns are shifted, but still not too far, and the distributions of “real” and “fake” patterns overlap considerably.
Accounting for Reconstruction Error
It appears that projections of “fake” patterns can get inside the area of the “real” class, for which the generative model is constructed, and projections of both classes can appear to be distributed almost identically. In this case, the generator will reconstruct “fake” patterns as “real” patterns (e.g., images of “5” will be reconstructed as “3”). At the same time, the model can appear to reconstruct “fake” patterns as well as “real” patterns. In both cases, the model can describe the distribution of “real” patterns equally well as shown in the following illustration.
AAE trained to generate images of “3” can reconstruct images of “5” both as “3” and “5” depending on the size of the latent code:
This means that projections of “fake” patterns can appear both inside and outside the region of projections of “real” patterns, and considering only this projection cannot be enough to judge if the pattern is “real” or “fake”.
As in the case of PCA, the model describes the distribution of patterns only in their projection to some manifold, while even “real” patterns can lie outside it — and the question of evaluating the probability density outside this manifold remains.
The simplest way is to use Euclidian distance from the pattern to its projection, i.e., to calculate the reconstruction error. This corresponds to the following extension of the generative model: x=G(z)+e, where z~N(0, I) as before and e~N(0, Is), where s is some scalar.
Let us consider separately on the possibility to perform “recognition” solely based on the reconstruction error.
The minimum error of 13.49% is achieved for 16 latent variables, which is similar to the error achieved with the use of the estimated density, but for smaller latent code (since for larger latent code, it is easier to achieve better reconstruction for “fake” patterns also).
The situation with AAE is similar, but the results are considerably better.
In particular, the error appears to be 6.8% for ten latent variables.
The whole generative model with pixel noise supposes that both components should be used to calculate the probability density.
Can “recognition” be improved by the use of their combination?
Unfortunately, it is not precisely the case. What we can obtain with this combination is nearly constant recognition rate independent of the size of the latent code. However, this rate corresponds to recognition rates with optimal sizes of the latent code, e.g., for PCA one can obtain:
Thus it is enough to use the reconstruction error with an appropriate latent code size (it is interesting to note that some papers propose more elaborated criteria incorporating density estimations on manifolds, but, in fact, the latent code size used in these models is optimal for the reconstruction error criterion, and this criterion gives mostly the same recognition results for these models as more elaborated ones as we found out in our experiments). Of course, if someone wants to use models with larger latent codes, for which the reconstruction error criterion gives poor results, taking the estimated density in the projection on the manifold would be essential.
Generative Adversarial Networks
Although the result of AAE is not bad, it is far from perfect. One can assume that AAEs are not too good as generative models. Models from the Generative Adversarial Networks (GANs) family are directly trained to generate images instead of reconstructing them, and they can produce sharper images.
Can GANs be better in describing density distributions of patterns (i.e., better literally as probabilistic generative models)? However, GANs should have an encoder to be applicable for our task.
One of the models with the encoder is BiGAN. Unfortunately, this model appears to be worse than the considered AAE. For example, with 100 latent variables, one can obtain the following result with 31.6% and 18.5% error rates correspondingly:
Similarly, the InfoGAN model can be tested. Although this model separates meaningful (structured) variables from noise variables in its latent code, and its Q-network, which can be used in place of the encoder, calculates only values of meaningful variables, some reconstruction can be performed (e.g., with finding best values of noise variables via sampling).
In particular, InfoGAN with a small number of meaningful variables (and a larger number of noise variables) behaves similarly to AAE with a small size of its latent code. For example, the InfoGAN model trained on the images of “3” will give the following reconstruction:
The reconstruction error criterion should work better in this case, but the recognition rate appears to be low (31.1%):
The reason is that the reconstruction of “real” patterns has considerably larger errors than that of AAE, i.e., it is far from precise on the pixel level. This result can be somewhat improved with a careful choice of the latent code structure, but not much (up to 20% errors).
Although InfoGANs can perform a reasonable unsupervised clustering given all mutually exclusive classes during training and given an appropriate structure of the latent code, it seems they do not model pattern distributions quite well (however, this is not a solid conclusion).
Although some papers argue that AAE is better suited for the class modeling task than Variational Autoencoders (VAE), VAE appeared to be comparable to or even better than AAE.
We took β-VAE — a recent extension of VAE used for disentangled representation and concept learning. However, best results were obtained with β=1 that corresponds to ordinary VAE (although one would expect that better disentanglement should be connected to the better class modeling). We used four convolutional layers and one dense layer in the encoder, and one dense and five (transpose) convolutional layers in the decoder. Among three sizes of the latent code (10, 20, 100), best results were obtained with the size 20 solely on the base of the reconstruction error criterion with 5.4% recognition error:
Interestingly, the reconstruction of “real” patterns appears to be quite good for very different images of “3”, while “fake” patterns are still reconstructed as “3”.
Unfortunately, this result is still not excellent. Indeed, if we use the nearest neighbor one-class classifier, which merely compares the distance to the nearest neighbor in the class being modeled with some threshold (instead of comparing the distances to the nearest neighbors in all classes), we can achieve about 10% error rate. If we use some additional tricks (earth move distance and deskewing), the error rate can be reduced to 6%, which is nearly the same as with the use of deep generative models.
This indicates that the density distribution of patterns is modeled by DNNs far from perfectly (in terms of capturing underlying regularities). Indeed, if we try to train the model on the images of “1”, the error rate will be about 1% while recognizing “1” against all other classes. At the same time, recognizing “8” against all other classes gives more than 11% error rate, implying that the deep models describe a more complex density distribution of images of “8” — much worse than “1” (in particular, images of “8” and “1” will be more frequently confused if we model the distribution of images of “8” than if we model “1”).
Also, if we take even more complex images, the result will still be unsatisfactory. One can argue that it would be too naïve to expect good results from such direct modeling (with not too deep networks and without showing images of any other class to the model) of the distribution of images corresponding to such concept as, for example, “person”. That is true, but still, the considered models achieve state-of-the-art results on MNIST, and these results cannot be called satisfactory.
If we cannot model with enough precision the distribution of centered images of “3” or “8”, which have low variability, what is wrong?
One can argue that computations performed by formal neurons are not well-suited to efficiently represent the necessary distributions, and some extensions like generative CapsNets or HyperNets are necessary.
However, this might not be the only and main reason. The problem may stem from the vagueness of the task statement. We do not want our generative model simply to describe the distribution of patterns in the training set. We want it to generalize, so we can calculate the probability density for new patterns and get non-zero values for patterns belonging to the modeled class. But, what class?
Should a model trained on the MNIST images of “5” assign a non-zero probability to slightly rotated images of “5” or images with an additional small stroke? If the model does not recognize such images as “5”, we could blame it for bad generalization.
However, if it assigns a high probability for images of other digits, is it that wrong since all such images belong to the same class of handwritten digits?
For example, if one trains a GAN model on images of one digit, and tries to run its discriminator on images of other digits, the discriminator will recognize them as “real” images. If the generator produced images of “5” instead of “3”, the discriminator could easily be able to learn to distinguish real images from fake, so this seems like an unexpected and incorrect behavior of the discriminator. However, one can also claim that the discriminator performed a nice generalization inducing the notion of a handwritten symbol.
We cannot hope that our generative model will guess our intentions and correctly generalizes that “5” with an additional stroke or almost without one stroke is still “5” while “3” is not a badly spelled “5”.
However, we may want our model to tell what is the specific difference between a new image and the previously seen images of the modeled class. We hope to describe in one of our future posts, how this can be done with the use of the representational minimum description length principle and module networks.
How Can You Get Involved?
If you would like to learn more about SingularityNET, we have a passionate and talented community which you can connect with by visiting our Community Forum. Feel free to say hello and to introduce yourself here. We are proud of our developers and researchers that are actively publishing their research for the benefit of the community; you can read the research here.
If you are looking to monetize your AI services or create new ones, we invite you to learn more about the nature of our platform and what its Beta version has to offer by visiting the SingularityNET developer portal.
For any additional information, please refer to our roadmaps and subscribe to our newsletter to stay informed about all of our developments.