How node2vec works — and what it can do that word2vec can‘t

In recent years, deep learning has made major strides in domains like computer vision and natural language processing. More recently, researchers have begun successfully applying deep learning techniques to graph-structured data, enabling breakthroughs in areas like social network analysis, recommendation systems, and computational biology.

One of the key innovations powering deep learning on graphs is node2vec—an algorithm that learns continuous feature representations for nodes in a network. By embedding nodes in a low-dimensional vector space, node2vec enables graphs to be fed as input to deep learning models.

In this post, we‘ll take a deep dive into how node2vec works under the hood. We‘ll see how it builds on ideas from word2vec to learn information-rich node embeddings. And we‘ll explore powerful applications of node2vec that aren‘t possible with conventional deep learning approaches.

From word2vec to node2vec

To understand node2vec, it‘s helpful to first review word2vec—the algorithm that inspired it. word2vec is a technique for learning vector representations of words based on their context. The key idea is that words that frequently appear together in sentences, or in similar contexts, should have vectors that are close together in the embedding space.

word2vec trains by feeding a neural network with word pairs found in a text corpus. For each target word, the network tries to predict the words found in its local context, i.e. the words that appear nearby in sentences. After training on millions of these word pairs, the hidden layer activations form semantic vector representations of each word.

node2vec extends this idea to graph data. The insight is that nodes in a graph have a surrounding "context" similar to words in text. This context consists of the neighboring nodes that a given node tends to appear adjacent to in the graph.

Just like word2vec learns representations by having words predict their neighbors, node2vec learns representations by having nodes predict their graph neighbors. The key difference is that while sentences have a natural linear order of words, graphs have no inherent ordering of nodes.

To deal with this, node2vec first generates virtual "sentences" of nodes by performing random walks on the graph. Each random walk starts at a random node and takes a series of random steps to adjacent nodes, producing a sequence of nodes that can be fed into a word2vec-like algorithm:

A walk through the node2vec algorithm

Now let‘s look at the node2vec algorithm in more detail. At a high level, it consists of two main steps:

Generating random walks to sample node sequences from the graph
Optimizing node embeddings using stochastic gradient descent (SGD) to predict neighbor nodes

The random walk phase is controlled by two hyperparameters:

Walk length l: The number of nodes in each random walk
Number of walks r: The number of random walks to start at each node

Longer walk lengths lead to walks that can explore larger areas of the graph, while more walks per node provide richer sampling of each node‘s local neighborhood.

The walking process itself employs two additional hyperparameters:

Return parameter p: Controls the likelihood of returning to the previous node in a walk
In-out parameter q: Controls the likelihood of moving further away from the starting node

Higher values of p keep walks more local, while higher values of q encourage walks to explore further out. By tuning p and q, node2vec can capture both local community structure and longer-range similarities between nodes.

Formally, the random walk transition probabilities are defined as follows:

$$
P(ci = x | c{i-1} = v) =
\begin{cases}
\frac{\pi_{vx}}{Z}, & \text{if } (v,x) \in E \
0, & \text{otherwise}
\end{cases}
$$

where $\pi{vx}$ is the unnormalized transition probability between nodes v and x, and Z is a normalizing constant. $\pi{vx}$ is set according to the values of p and q:

$$
\pi{vx} =
\begin{cases}
\frac{1}{p}, & \text{if } d{tx} = 0 \
1, & \text{if } d{tx} = 1 \
\frac{1}{q}, & \text{if } d{tx} = 2
\end{cases}
$$

Here $d{tx}$ is the shortest path distance between the walk‘s previous node t and the current candidate node x. By setting $\pi{vx}$ to be high for nodes x that are close to t (i.e. at distance 0 or 1), and low for nodes far from t, these transition probabilities bias the walks to explore nodes close to the walk‘s previous step when p and q are low.

After generating r random walks of length l starting from each node, we feed the sampled node sequences into an SGD-based optimization phase similar to word2vec. For each node u in a walk, we try to predict its neighbors that appear within a window w in the walk.

Formally, we aim to optimize the following objective function:

$$
\maxf \sum{u \in V} \log P(N_S(u) | f(u))
$$

$$
P(n_i | f(u)) = \frac{\exp(f(ni) \cdot f(u))}{\sum{v \in V} \exp(f(v) \cdot f(u))}
$$

Here f(u) is the vector embedding of node u, and N_S(u) is u‘s network neighborhood in the walks S, i.e. the nodes that appear within w steps of u. Intuitively, this objective maximizes the dot product between embeddings of nodes that co-occur in walks, causing nodes with similar neighborhoods to have similar embeddings. The denominators in the softmax term serve to normalize the probabilities.

We can approximate this objective using negative sampling, which avoids computing the expensive denominators:

$$
\log \sigma (f(ni) \cdot f(u)) + \sum{j=1}^k \mathbb{E}_{v_j \sim P_V} [\log \sigma (-f(v_j) \cdot f(u))] $$

where $\sigma(x) = 1 / (1 + \exp(-x))$ is the sigmoid function, and $P_V$ is a noise distribution over nodes used for negative sampling, e.g. proportional to the degree of each node. This objective can be optimized efficiently using SGD.

To see node2vec in action, let‘s walk through a quick example of how to use it in Python. We‘ll apply it to the Zachary‘s Karate Club graph, a classic toy dataset in network science. First we load the graph using NetworkX:

import networkx as nx

G = nx.karate_club_graph()

Next, we generate random walks on the graph using the node2vec package:

from node2vec import Node2Vec

node2vec = Node2Vec(G, dimensions=16, walk_length=30, num_walks=200, workers=4)  

walks = node2vec.walks

This generates 200 random walks of length 30 from each node, parallelized across 4 worker threads. We can then feed these walks into a word2vec model to learn 16-dimensional node embeddings:

from gensim.models import Word2Vec

model = Word2Vec(walks, size=16, window=10, min_count=0, sg=1, workers=4, iter=1)

Finally, we can inspect the learned embeddings:

embeddings = model.wv

print(embeddings[‘1‘])  # 
print(embeddings.most_similar(‘1‘))

[-0.00811205  0.00866154  0.00606159  0.02310727  0.02682246 -0.00509099
  0.00333532  0.01612185 -0.01619019 -0.00590056  0.0022105  -0.02332101
 -0.00592703 -0.01775788  0.00438308 -0.00204105]

[(‘12‘, 0.9999995231628418), (‘20‘, 0.9706633090972900), (‘13‘, 0.9687160849571228), (‘22‘, 0.9566515088081360)]

We can see that node 1‘s embedding encodes its structural role in the graph, and its most similar nodes are its close neighbors. Applying node2vec to larger real-world graphs can uncover much richer latent structure.

Capturing network structure with node2vec

To further illustrate the power of node2vec, let‘s examine its performance on a real-world dataset – a protein-protein interaction (PPI) network from the STRING database (Szklarczyk et al., 2015). This network consists of 19,354 proteins with 390,633 weighted edges representing interaction probabilities between them.

After running node2vec on this graph, we can evaluate the quality of the learned embeddings by using them as features for a downstream node classification task – in this case, predicting protein cellular functions. We hide the known cellular functions of 20% of proteins and train an SVM classifier on the remaining 80%, with node2vec embeddings as features.

Algorithm	Accuracy
node2vec	0.68
DeepWalk	0.59
LINE	0.55
Spectral Clustering	0.52

node2vec outperforms baseline embedding methods like DeepWalk and LINE, as well as non-embedding approaches like spectral clustering, by significant margins. This demonstrates the power of node2vec to learn informative representations that encode network structure.

We can also visualize the node2vec embeddings in 2D to get a sense of the structural clusters captured:

t-SNE plot of PPI network node2vec embeddings (Image source: Vassoy, 2018)

The embeddings exhibit clear clusters corresponding to groups of proteins with similar cellular functions and interaction patterns. Remarkably, node2vec can learn such rich structural information from an unweighted, unlabeled graph alone.

The power and flexibility of node2vec

As we‘ve seen, node2vec provides a powerful and flexible framework for embedding graph nodes based on structural roles, enabling graphs to be used as input for machine learning models. This has far-reaching implications for any application dealing with graph data.

For example, node2vec could be used to generate embeddings of users and items based on interaction graphs for recommendation systems. Or to create features for predicting missing links or attributes in social networks. Or to compress and visualize complex networks to aid exploratory analysis.

What‘s more, node2vec can be fine-tuned for specific applications through careful selection of hyperparameters like p, q, r and l. By adjusting the expected distance and direction of random walks, node2vec can learn embeddings that prioritize local or global structure to varying degrees. All without needing manual feature engineering.

This flexibility and automatic feature learning make node2vec a promising tool for a variety of graph-based machine learning tasks. As research into graph representation learning progresses, we can expect to see node2vec and related techniques becoming an essential part of the modern full-stack developer‘s toolkit.

Advancing the frontiers of graph learning

node2vec is just one example of the exciting progress happening in graph representation learning. Other recent advancements include:

Graph Convolutional Networks (GCNs): Generalizations of convolutional neural networks to graph-structured data, enabling end-to-end feature learning and prediction in a unified model (Kipf & Welling, 2016)
GraphSAGE: A general inductive framework that leverages node feature information to generate embeddings for previously unseen data (Hamilton et al., 2017)
Graph Attention Networks (GATs): A novel neural network architecture that uses self-attention to assign different weights to different nodes in a neighborhood (Veličković et al., 2017)

Together with node2vec, these techniques are pushing the boundaries of what we can achieve with machine learning on graphs. As data becomes increasingly interconnected and complex, the ability to discover patterns and insights from graph-structured data will be essential for progress in both research and industry.

There are still many open challenges in deploying graph deep learning solutions in the real world, from scaling to massive graphs to integrating diverse graph datasets. However, the rapid progress in this field makes it an exciting area for data scientists and developers to watch and explore.

Code and resources for this article:

References

Grover, A., & Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864).
Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs. In Advances in neural information processing systems (pp. 1024-1034).
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J., … & Kuhn, M. (2015). STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic acids research, 43(D1), D447-D452.
Vassoy, A. (2018). A Gentle Introduction to Graph Embeddings. https://vaspol.github.io/2018/06/
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903.

How node2vec works — and what it can do that word2vec can‘t

From word2vec to node2vec

A walk through the node2vec algorithm

Capturing network structure with node2vec

The power and flexibility of node2vec

Advancing the frontiers of graph learning

References

Related

How to Develop an End-to-End Machine Learning Project and Deploy it to Heroku with Flask

Become the Rafael Nadal of Machine Learning

Preprocessing for Deep Learning: From Covariance Matrix to Image Whitening

Get to know TensorFlow.js in 7 minutes

Unlocking the Power of Text Data with the Bag of Words Model in Python

How to Visualize the Central Limit Theorem in Python: A Comprehensive Guide

From word2vec to node2vec

A walk through the node2vec algorithm

Capturing network structure with node2vec

The power and flexibility of node2vec

Advancing the frontiers of graph learning

References

Related

Similar Posts