Measuring what makes the scene
When an experienced photographer is preparing to take a studio portrait, they’ll take careful control of the room, knowing how light works in harmony with their subject’s body language, facial structure and expressions.
Then a photographer will take great pains to adjust camera settings such as focal length, exposure and the type and size of the lens to produce the intended visual effect. Most of us will appreciate the beauty of that resulting photograph, while some trained eyes will appreciate the technical mastery behind it.
How can we give a machine such a capability to figure out the underlying factors involved in the production of a picture? An emerging area of machine learning attempting to answer such a question is known as learning disentangled representations. A representation is how the machine “sees” the world. The machine’s view may not make sense to humans: the important concepts (called factors) are all mixed together, or entangled, with one another. A disentangled representation is composed of multiple factors that are sufficiently independent of each other so that a change in one factor does not change another. For example, a photographer can change the amount of light while the model’s pose stays the same. If a machine could have a disentangled view of the world, then it could be more effective at making decisions. For example, if the task were to identify a person, only the facial structure should matter. Other factors such as lighting and pose can be safely ignored because they are irrelevant to the task at hand.
The problem is that we do not know a mathematically precise way of defining the disentanglement in a manner that enables the machine to proceed with its analysis. Machines, after all, are still not good at handling “fuzzy” concepts. Even if we did have a way to disentangle the concepts, we lack robust metrics for evaluating the quality of the disentanglement. This makes it hard to know if these algorithms are doing a good job at disentangling!
Our latest paper is titled “Theory and Evaluation metrics for Learning Disentangled Representation”, led by our recent PhD graduate, Kien Do. This work is an attempt to solve the puzzle from an information-theoretical perspective, a discipline of mathematics which quantifies information content and relations. The paper has just been presented at the International Conference on Learning Representations (ICLR) in April 2020, an important conference in Machine Learning.
Our paper argues that a representation of the world can be called disentangled if it has three properties (1) informativeness, (2) separability and (3) interpretability. Representations learned from observing the world need to maintain as much information as possible (informativeness) while each representation needs to have sufficiently different meanings (separability). Furthermore, the representations should also be understandable by humans (interpretability). In the case of portrait photography, examples of disentangled concepts might include lighting, posture, and skin profile.
Robust information-theoretic metrics
Based on information theory, we expressed the three criteria as equations that let us measure disentanglement quantitatively. These metrics can be used to compare disentanglement across a variety of methods and to ensure that real data sets are on the same playing field. The metrics are also easy to compute, and to visualise intuitively.
Title: Theory and Evaluation Metrics for Learning Disentangled Representation
Authors: Kien Do and Truyen Tran
Written by Truyen Tran & Dung Nguyen
Edited by Larissa Ham and Thomas Quinn.