DeepWalk Algorithm

LucaSantagata

Deep Walk

DeepWalk is a type of graph neural network that uses language model to learn latent representations of vertices in a network. DeepWalk includes two main components: Firstly, it uses random walk to infer main local components in the graph. Secondly, it uses SkipGram algorithm to train parameters for embeddings.

Node embeddings

image.png

The general idea of node embedding is to map nodes into an embedding space that has less dimension than the number of nodes in the graph. In embedding space, it retains some of the same properties as in the graph. The similarity of a pair of nodes in the graph corresponds to their similarity in the embedding space. For example, if two nodes are connected on a graph, they will be close together in the embedding space.

image.png

DeepWalk Algorithm

Let G = (V, E) is a graph, where V is the vertices set of the graph and E is the edges set of the graph, and E ⊆ (V × V ).

image.png

image.png

The algorithm consists of two main components: first a random walk generator and second an update procedure by SkipGram.

A random walk on a graph is a process that begins at some vertex, and at each time step moves randomly to another vertex. In DeepWalk algorithm, we have γ\gamma random walks for each vertex. Each random walk has length tt. Example in figure below, for vertex 0, we generate a random walk with length t = 5. Random walk may be 0 → 1 →2 → 7 → 3 → 1 or 0 → 4 → 6 → 4 → 10 →5,…

image.png

Thus, we have sampled random walks from the original graph. After that, the authors use SkipGram for node embedding. SkipGram is a language model that maximizes the cooccurrence probability among the words that appear within a window, w, in a sentence.Thus, we have sampled random walks from the original graph. After that, the authors use SkipGram for node embedding. SkipGram is a language model that maximizes the cooccurrence probability among the words that appear within a window, ww, in a sentence.

image.png

For each vertex in walk, the algorithm will maximize the probability of its neighbors in the window size. The model can learn such posterior distribution using several choices of classifiers. However, calculating Pr(ukΦ(vj))Pr(u_k | Φ(vj ))is not feasible. The authors propose to assign the vertices to the leaves of a binary tree, the prediction problem turns into maximizing the probability of a specific path in the tree.

image.png

image.png

DeepWalk implementation

Here a simple DeepWalk algorithm is coded by Pytorch and applied for Karate dataset

from gensim.models.word2vec import Word2Vec
import networkx as nx
import matplotlib.pyplot as plt
import random
from sklearn.manifold import TSNE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
random.seed(238842)

def random_walk(start, length):
    walk = [str(start)]  # starting node

    for i in range(length):
        neighbors = [node for node in G.neighbors(start)]
        next_node = np.random.choice(neighbors, 1)[0]
        walk.append(str(next_node))
        start = next_node

    return walk
# Load dataset
G = nx.karate_club_graph()

# Process labels (Mr. Hi = 0, Officer = 1)
labels = []
for node in G.nodes:
    label = G.nodes[node]['club']
    labels.append(1 if label == 'Officer' else 0)

# Plot graph
plt.figure(figsize=(12,12))
plt.axis('off')
nx.draw_networkx(G,
                 pos=nx.spring_layout(G, seed=0),
                 node_color=labels,
                 node_size=800,
                 cmap='coolwarm',
                 font_size=14,
                 font_color='white'
                 )
plt.title("Zachary's Karate Club")
Text(0.5, 1.0, "Zachary's Karate Club")
# Create a list of random walks   --->  create 80 random walks of length 10 for every node in the graph
walks = []
for node in G.nodes:
    for _ in range(80):
        walks.append(random_walk(node, 10))
# Create Word2Vec
model = Word2Vec(walks,
                 hs=1,   # Hierarchical softmax
                 sg=1,   # Skip-gram
                 vector_size=100,
                 window=10,
                 workers=1,
                 seed=1)

print(f'Shape of embedding matrix: {model.wv.vectors.shape}')

# Build vocabulary
model.build_vocab(walks)

# Train model
model.train(walks, total_examples=model.corpus_count, epochs=30, report_delay=1)
WARNING:gensim.models.word2vec:Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. 
WARNING:gensim.models.keyedvectors:sorting after vectors have been allocated is expensive & error-prone
WARNING:gensim.models.word2vec:Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. 
WARNING:gensim.models.word2vec:Effective 'alpha' higher than previous training cycles
Shape of embedding matrix: (34, 100)
(187155, 897600)
# Most similar nodes
print('Nodes that are the most similar to node 0:')
for similarity in model.wv.most_similar(positive=['0']):
    print(f'   {similarity}')

# Similarity between two nodes
print(f"\nSimilarity between node 0 and 4: {model.wv.similarity('0', '4')}")
Nodes that are the most similar to node 0:
   ('10', 0.705403745174408)
   ('11', 0.6672751307487488)
   ('4', 0.6428819298744202)
   ('6', 0.6198117733001709)
   ('21', 0.5753284096717834)
   ('5', 0.5725014209747314)
   ('12', 0.5587387681007385)
   ('17', 0.5484284162521362)
   ('1', 0.5292837023735046)
   ('16', 0.5212982296943665)

Similarity between node 0 and 4: 0.6428819298744202
# Preprocess word vectors and label
nodes_wv = np.array([model.wv.get_vector(str(i)) for i in range(len(model.wv))])
labels = np.array(labels)

# Train TSNE
tsne = TSNE(n_components=2,
            learning_rate='auto',
            init='pca',
            random_state=0).fit_transform(nodes_wv)

# Plot TSNE
plt.figure(figsize=(6, 6))
plt.scatter(tsne[:, 0], tsne[:, 1], s=100, c=labels, cmap="coolwarm", edgecolor="black")
plt.title("t-SNE plot of the nodes")
plt.show()

The plot is quite encouraging since we ca see a possible clear line that separates the two classes. It should be possible for a simple ML algorithm (like a Random Forest classifier) to classify thes nodes.

# Create masks to train and test the model
train_mask = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24]
test_mask = [0, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33]

# Train classifier
clf = RandomForestClassifier(random_state=0)
clf.fit(nodes_wv[train_mask], labels[train_mask])

# Evaluate accuracy
y_pred = clf.predict(nodes_wv[test_mask])
acc = accuracy_score(y_pred, labels[test_mask])
print(f'Accuracy = {acc*100:.2f}%')
Accuracy = 95.45%

Summarizing, DeepWalk is a method for graph embedding. It uses random walk to downstream task and uses SkipGram to learn node and graph embedding.