Graph Neural Networks

A Comprehensive Survey on Graph Neural Networks

GNN分类（按更新方式）

与图嵌入的关系

Revisiting Semi-Supervised Learning with Graph Embeddings

Graph-Based Semi-Supervised Learning

Graph-based semi-supervised learning is based on the assumption that nearby nodes tend to have the same labels.

Loss:

$\sum_{i = 1}^Ll(y_i, f(x_i)) + \lambda\sum_{i,j}a_{ij}||f(x_i)-f(x_j)||^2$

the first term is the standard supervised loss function
The second term is the graph Laplacian regularization, which incurs a large penalty when similar nodes with a large $w_{ij}$ are predicted to have different labels

Learning Embeddings

A number of embedding learning methods are based on the Skipgram model, which is a variant of the softmax model

Loss:

$-\sum_{(i,c)}logp(c|i) = -\sum_{(i,c)}(w_c^Te_i-log\sum_{c'\in C}exp(W^T_{c'}e_i))$

$C$ is the set of all possible context
$w$ are parameters of the Skipgram model
$e_i$ is the embedding of instance $i$

Application:

word2vec(NLP)
Deepwalk(Graph)
LINE(multiple context spaces)

Semi-Supervised Learning with Graph Embeddings

Loss:

$\mathcal{L}_s + \lambda\mathcal{L}_u$

$\mathcal{L}_s$ is a supervised loss of predicting the labels
$\mathcal{L}_u$ is an unsupervised loss of predicting the graph context

Sampling Context

We formulate the unsupervised loss $\mathcal{L}_u$ as a variant of Skipgram loss

negative sampling:

we are sampling $(i, c, γ)$ from a distribution, where $i$ and $c$ denote instance and context respectively, $γ = +1$ means $(i, c)$ is a positive pair and $γ = −1$ means negative. Given $(i, c, γ)$, we minimize the cross entropy loss of classifying the pair $(i, c)$ to a binary label $γ$:

$-\mathbb{I}(\gamma = 1)log\sigma(W_c^Te_i) - \mathbb{I}(\gamma = - 1)log\sigma(-w_c^Te_i)$

sampling algorithm:

example:

Transductive Formulation

The two hidden layers are concatenated, and fed to a softmax layer to predict the class label of the instance

Inductive Formulation

Conclusion

contributions:

incontrast to previous semi-supervised learning approaches that largely depend on graph Laplacian regularization, we propose a novel approach by joint training of classification and graph context prediction
since it is difficult to generalize graph embeddings to novel instances, we design a novel inductive approach that conditions embeddings on input features

ANRL: Attributed Network Representation Learning via Deep Neural Networks

issues：

network topology and node attributes are two heterogeneous information sources, thus it is challenging to preserve their properties in a common vector space
the observed network data is often incomplete and even noisy, which brings more difficulties for obtaining informative representations

contributions:

integrates network structural proximity and node attributes affinity into low-dimensional representation spaces
- we design a neighbor enhancement autoencoder, which can retain better similarity between data samples in the representation space
- We also propose an attribute-aware skip-gram model to capture the structure correlations
- These two components share connections to the encoder

Proposed Model

problem formulation

we aim to represent:

each node $i ∈ V$ as a low-dimensional vector $y_i$
mapping function f preserves not only network structure but also node attribute proximity

neighbor enhancement autoencoder

Loss:

$\mathcal{L}_{ae} = \sum_{i = 1}^n||\hat{x}_i - T(v_i)||_2^2$

$\hat{x}_i$ is the reconstruction output of decoder
$T(v_i)$ returns the target neighbors of $v_i$
$T(·)$ incorporates prior knowledge into the model and can be adopted by the following two formulations:
- Weighted Average Neighbor
- Elementwise Median Neighbor

attribute-aware skip-gram model

Loss:

$\mathcal{L}_{sg} = -\sum_{i = 1}^n\sum_{c \in C}\sum_{-b \le j \le b}logp(v_{i + j}|X_i)$

where:

$v_{i+j}$ is the node context in the generated random sequence
$b$ is the window size

negative sampling:

for a specific node-context pair $(v_i,v_{i+j})$, we have the following objective:

$log\sigma(V_{i + j}^{'T}f(x_i)) + \sum_{n = 1}^{|neg|}\mathbb{E}_{v_n \sim p_n(v)}[log\sigma(-V_n^{'T}f(x_i))]$

where:

$P_n(v) ∝ d_v^{3/4}$ as suggested in [Mikolov et al., 2013b]
where $d_v$ is the degree of node v

joint optimization

Loss:

$L = L_{sg} + \alpha L_{ae} + \beta L_{reg}$

Conclusions

incorporates both node attributes and network structure information into the embedding
To further address the structure proximity and attribute affinity preserving, we develop a neighbor enhancement autoencoder and attribute-aware skip-gram model to exploit the complex interrelations between structural information and attributes

the sparsity of a network representation is both a strength and a weakness:

Sparsity enables the design of efficient discrete algorithms
but can make it harder to generalize in statistical learning

We develop an algorithm (DeepWalk) that learns social representations of a graph’s vertices, by modeling a stream of short random walks

DeepWalk takes a graph as input and produces a latent representation as an output

We propose an unsupervised method which learns features that capture the graph structure independent of the labels’ distribution.

We seek learning social representations with the following characteristics:

Adaptability - Real social networks are constantly evolving; new social relations should not require repeating the learning process all over again.
Community aware - The distance between latent dimensions should represent a metric for evaluating social similarity between the corresponding members of the network. This allows generalization in networks with homophily.
Low dimensional - When labeled data is scarce, low-dimensional models generalize better, and speed up convergence and inference.
Continuous - We require latent representations to model partial community membership in continuous space. In addition to providing a nuanced view of community membership, a continuous representation has smooth decision boundaries between communities which allows more robust classification.

random walks

We denote a random walk rooted at vertex $v_i$ as $W_{v_i}$ . It is a stochastic process with random variables $W_{v_i}^1 ,W_{v_i}^2 ,…,W_{v_i}^k$
such that $W_{v_i}^{k + 1}$ is a vertex chosen at random from the neighbors of vertex $v_k$

connection: power laws

we observe that the frequency which vertices appear in the short random walks will also follow a power-law distribution
word frequency in natural language follows a similar distribution

Language Modeling

relaxations：

from

$Pr(v_i | (\Phi(v_1),\Phi(v_2),\dots,\Phi(v_{i-1})))$

$\underset{\Phi}{minimize} -logPr(\{v_{i-w}, \dots, v_{i - 1}, v_{i + 1}, \dots, v_{i+1}\}| \Phi(v_i))$

First, the order independence assumption better captures a sense of nearness that is provided by random walks.
Moreover, this relaxation is quite useful for speeding up the training time by building small
models as one vertex is given at a time

Method

DeepWalk

the algorithm consists of two main components:

a random walk generator
an update procedure

A walk samples uniformly from the neighbors of the last vertex visited until the maximum $length(t)$ is reached

At the start of each pass we generate a random ordering to traverse the vertices. This is not strictly required, but is well-known to speed up the convergence of stochastic gradient descent.

skipgram

To speed the training time, Hierarchical Softmax can be used to approximate the probability distribution

hierarchical softmax

we assign the vertices to the leaves of a binary tree, the prediction problem turns into maximizing the probability of a specific path in the tree

$Pr(u_k|\Phi (v_j)) = \prod_{l = 1} ^{\left \lceil log|V| \right \rceil}Pr(b_l | \Phi(v_j))$

Huffman coding is used to reduce the access time of frequent elements in the tree

Variants

Streaming
Non-random walks

LINE: Large-scale Information Network Embedding

We propose a novel network embedding model called the “LINE,” which suits arbitrary types of information networks and easily scales to millions of nodes. It has a carefully designed objective function that preserves both the first-order and second-order proximities.
We propose an edge-sampling algorithm for optimizing the objective. The algorithm tackles the limitation of the classical stochastic gradient decent and improves the effectiveness and efficiency of the inference.

Problem definition

First-order Proximity:

the local pairwise proximity between two vertices

Second-order Proximity:

proximity between a pair of vertices (u,v) in a network is the similarity between their neighborhood network structures

Large-Scale Information Network Embedding

LINE with First-order Proximity

define the joint probability between vertex $v_i$ and $v_j$ as follows:

$p_1(v_i, v_j) = \frac {1} {1 + exp(-u_i \cdot u_j)}$

empirical probability:

$\hat {p}_1(i,j) = \frac {w_{ij}} {W}$

objective function:

$O_1 = d(\hat p_1(\cdot, \cdot), p_1(\cdot, \cdot))$

Replacing d(·, ·) with KL-divergence and omitting some constants, we have:

$O_1 = -\sum_{(i,j) \in E}w_{ij}log p_1(v_i, v_j)$

LINE with Second-order Proximity

define the joint probability between vertex $v_i$ and $v_j$ as follows:

$p_2(v_j|v_i) = \frac {exp(u_j \cdot u_i)} {\sum_{k =1}^{|V|}exp(u_k \cdot u_i)}$

empirical probability:

$\hat p_2(v_j|v_i) = \frac {w_{ij}} {d_i}$

objective function:

$O_2 = \sum_{i \in V}\lambda_id(\hat p_2(\cdot|v_i), p_2(\cdot|v_i))$

Replacing d(·, ·) with KL-divergence and omitting some constants, we have:

$O_2 = -\sum_{(i,j) \in E}w_{ij}log p_2(v_j | v_i)$

Model Optimization

negative sampling
Optimization via Edge Sampling

A Comprehensive Survey on Graph Neural Networks

Revisiting Semi-Supervised Learning with Graph Embeddings

Related Work

Graph-Based Semi-Supervised Learning

Learning Embeddings

Semi-Supervised Learning with Graph Embeddings

Sampling Context

Transductive Formulation

Inductive Formulation

Conclusion

ANRL: Attributed Network Representation Learning via Deep Neural Networks

Proposed Model

problem formulation

neighbor enhancement autoencoder

attribute-aware skip-gram model

joint optimization

Conclusions

DeepWalk: Online Learning of Social Representations

Learning Social Representations

random walks

connection: power laws

Language Modeling

Method

DeepWalk

skipgram

hierarchical softmax

Variants

LINE: Large-scale Information Network Embedding

Problem definition

Large-Scale Information Network Embedding

LINE with First-order Proximity

LINE with Second-order Proximity

Model Optimization