Blog: Overview of Deep Learning on Graph Embeddings
Graph Learning and Geometric Deep Learning — Part 1
There are alot of ways machine learning can be applied to graphs. One of the easiest is to turn graphs into a more digestible format for ML.
Graph embedding is an approach that is used to transform nodes, edges, and their features into vector space (a lower dimension) whilst maximally preserving properties like graph structure and information. Graphs are tricky because they can vary in terms of their scale, specificity, and subject.
A molecule can be represented as a small, sparse, and static graph, whereas a social network could be represented by a large, dense, and dynamic graph. Ultimately this makes it difficult to find a silver bullet embedding method. The approachess that will be covered each vary in performance on different datasets, but they are the most widely used in Deep Learning.
Graphs are the main focus of this series, but if the 3D imaging applications are more your thing, then I recommend this fantastic article by the Gradient.
Embedding Graph Networks
If we view embedding as a transformation to a lower dimension, embedding methods themselves are not a type of neural network model. Instead,they are a type of algorithm used in graph pre-processing with the goal to turn a graph into a computationally digestible format. This is because graph type data, by nature, are discrete.
Machine learning algorithms are tuned for continuous data, hence why embedding is always to a continuous vector space.
As recent work has shown, there is a variety of ways to go about embedding graphs, each with a different level of granularity. Embeddings can be performed on the node level, the sub-graph level, or through strategies like graph walks. These are some of the most popular methods.
Deepwalk isn’t the first of it’s kind, but it is one of the first approaches that have been widely used as a benchmark in comparison with other graph learning approaches. Deepwalk belongs to the family of graph embedding techniques that uses walks, which are a concept in graph theory that enables the traversal of a graph by moving from one node to another, as long as they are connected to a common edge.
If you represent each node in a graph with an arbitrary representation vector, you can traverse the graph. The steps of that traversal could be aggregated by arranging the node representation vectors next to each other in a matrix. You could then feed that matrix representing the graph to a recurrent neural net. Basically, you can use the truncated steps of graph traversals as input for an RNN. This is analogous to the way word vectors in a sentence are put together.
The approach taken by DeepWalk is to complete a series of random walks using the equation:
The goal is to estimate the likelihood of observing node vi given all the previous nodes visited so far in the random walk, where Pr() is probability, Φ is a mapping function that represents the latent representation associated with each node v in the graph.
The latent representations is what becomes the input for a neural network. The neural network, based on what nodes and how often the nodes were encountered during the walk, can make a prediction about a node feature or classification.
The method used to make predictions is skip-gram, just like in Word2vec architecture for text. Instead of running along the text corpus, DeepWalk runs along the graph to learn an embedding. The model can take a target node to predict it’s “context”, which in the case of a graph, means it’s connectivity, structural role, and node features.
Although DeepWalk is relatively efficient with a score of O(|V|), this approach is transductive, meaning whenever a new node is added, the model must be retrained to embed and learn from the new node.
You’ve heard of Word2vec now prepare for… Node2vec (Aditya Grover et al)
One of the more popular graph learning methods, Node2vec is one of the first Deep Learning attempts to learn from graph structured data. The intuition is similar to that of DeepWalk:
If you turn each node in a graph into an embedding as you would words in sentence, a neural network can learn representations for each node.
The difference between Node2vec and DeepWalk is subtle but significant. Node2vec features a walk bias variable α, which is parameterized by p and q. The parameter p prioritizes a breadth-first-search (BFS) procedure, while the parameter q prioritizes a depth-first-search (DFS) procedure. The decision of where to walk next is therefore influenced by probabilities 1/p or 1/q.
As the visualization implies, BFS is ideal for learning local neighbors, while DFS is better for learning global variables. Node2vec can switch to and from the two priorities depending on the task. This means that given a single graph, Node2vec can return different results depending on the values of the parameters. As per DeepWalk, Node2vec also takes the latent embedding of the walks and uses them as input to a neural network to classify nodes.
Experiments demonstrated that BFS is better at classifying according to structural roles (hubs, bridges, outliers, etc.) while DFS returns a more community driven classification scheme.
Node2vec is one of the many graph learning project that have come out of Stanford’s SNAP research group dedicated to graph analytics. Many of their works have been the origin of many great strides in geometric Deep Learning.
A modification to the node2vec variant, graph2vec essentially learns to embed a graph’s sub-graphs. This is demonstrated by an equation that is used in doc2vec, a closely related variant, and a point of inspiration for this paper.
In plain english, this equation can be written as: the probability of the word (wj) appearing in context given document (d) equals the exponential of the document embedding matrix (d~) multiplied by the word embedding matrix (w~j is sampled from the document), divided by the sum of all the exponentials of the document embedding matrix multiplied by the word embedding matrix for each word in the vocab list (V) across all documents.
Using an analogy with word2vec, if a document is made of sentences (which is then made of words), then a graph is made of sub-graphs (which is then made of nodes).
These predetermined sub-graphs have a set number of edges, as specified by the user. Once again, it is the latent sub-graph embeddings that are passed into a neural network for classification.
Unlike the previous embedding techniques, SDNE does not use random walks. Instead, it tries to learn from two distinct metrics:
- First-order proximity: two nodes are considered similar if they share an edge (pairwise similarity)
- Second-order proximity: two nodes are considered similar if they share many neighboring/adjacent nodes
The ultimate goal is to capture highly non-linear structures. This is acheived by using deep autoencoders (semi-supervised) to preserve the first order (supervised) and second order (unsupervised) network proximities.
To preserve first order proximity, the model also a variation of Laplacian Eigenmaps, a graph embedding/dimensionality reduction technique. The Laplacian Eigenmap embedding algorithm applies a penalty when similar nodes are mapped far from each other in the embedded space, thus allowing for optimization by minimizing the space between similar nodes.
The second order proximity is preserved by passing the adjacency matrix of te graph through an unsupervised autoencoder which has a built in reconstruction loss function it must minimize.
Together, the first order proximity loss function and the second order reconstruction loss function are jointly minimized to return a graph embedding. The embedding is than learned from by a neural network.
LINE (Jian Tang et al) explicitly defines two functions; one for first order proximity and another for second order proximity. In the experiments conducted by the original research, second order proximity performed significantly better than first, and it was implied that including higher orders may level off the improvements in accuracy.
The goal of LINE is to minimize the difference between the input and embedding distributions. This is achieved using KL divergence:
The visualizations are simple, the math less so. Aurélien Géron has a great video on the subject. On another note, Géron is also one of the few researchers in graph learning, and has combined knowledge graphs and Deep Learning to improve video recommendations while working at YouTube.
LINE defines two joint probability distributions for each pair of nodes then minimizes the KL divergence of the distributions. The two distributions are the adjacency matrix and the dot product of node embedding. KL Divergence is an important similarity metric in information theory and entropy. The algorithm is used in probabilistic generative models like Variational Autoencoders, which embed inputs of an autoencoder into a latent space, which becomes the distribution.
Since the algorithm has to define new functions for each increasing order of proximity, LINE doesn’t perform very well if the application needs an understanding of node community structure.
Nevertheless, the simplicity and effectiveness of LINE are just a couple reasons why it was the most cited paper on WWW of 2015. This work helped inspire interest in Graph Learning as a niche in Machine Learning and eventually Deep Learning in specific.
HARP is an improvement to the previously mentioned embedding/walking based models. Previous models risked getting stuck in local optima since their objective functions are non-convex. Basically this means, the ball can’t roll to the absolute bottom of the hill.
Therefore, the intended motive:
Improve the solution and avoid local optima by better weight inizialization.
and the proposed method:
Use graph coarsening to aggregate related nodes into “supernodes”
HARP is essentially a graph-preprocessing step that simplifies the graph to make for faster training.
After coarsening the graph, it then generates an embedding of the coarsest “supernode”, followed by an embedding of the entire graph (which itself is made of supernodes).
This strategy is followed for each “supernode” in the entire graph.
Since HARP can be used in conjunction with previous embedding algorithms like LINE, Node2vec and DeepWalk. The original paper reported marked improvements of up to 14% in classification tasks when combining HARP with various graph embedding methods: a significant leap forward.
I’ve definitely missed a bunch of algorithms and models, especially since the recent explosion of interest in Geometric Deep Learning and Graph Learning has led to new contributions popping up in publications almost daily.
In any case, Graph embedding methods are a simple but very effective method of transforming graphs into the optimal format for a machine learning task. Due to their simplicity, they are often quite scalable (at least compared to their convolutional counterparts), and are easy to implement. They can be applied to most networks and graphs without sacrificing performance or efficiency. But can we do better?
Next up is a dive into the complex and elegant world of Graph Convolutions!
- Graph embedding techniques take graphs and embed them in a lower dimensional continuous latent space before passing that representation through a machine learning model.
- Walk embedding methods perform graph traversals with the goal of preserving structure and features and aggregates these traversals which can then be passed through a recurrent neural network.
- Proximity embedding methods use Deep Learning methods and/or proximity loss functions to optimize proximity, such that nodes that are close together in the original graph are likewise in the embedding.
- Other approaches use methods like graph coarsening to simplify the graph before applying an embedding technique on the graph, reducing complexity while preserving structure and information.
Need to see more content like this?
I’m always looking to meet new people, collaborate, or learn something new so feel free to reach out to email@example.com
Upwards and onwards, always and only 🚀