## A Comprehensive Survey on Graph Neural Networks

- GNN分类（按更新方式）

- 与图嵌入的关系

## Revisiting Semi-Supervised Learning with Graph Embeddings

### Related Work

#### Graph-Based Semi-Supervised Learning

- Graph-based semi-supervised learning is based on the assumption that nearby nodes tend to have the same labels.

**Loss:**

- the first term is the standard supervised loss function
- The second term is the graph Laplacian regularization, which incurs a large penalty when similar nodes with a large $w_{ij}$ are predicted to have different labels

### Learning Embeddings

A number of embedding learning methods are based on the **Skipgram** model, which is a variant of the softmax model

**Loss:**

- $C$ is the set of all possible context
- $w$ are parameters of the Skipgram model
- $e_i$ is the embedding of instance $i$

**Application:**

- word2vec(NLP)
- Deepwalk(Graph)
- LINE(multiple context spaces)

### Semi-Supervised Learning with Graph Embeddings

**Loss:**

- $\mathcal{L}_s$ is a supervised loss of predicting the labels
- $\mathcal{L}_u$ is an unsupervised loss of predicting the graph context

#### Sampling Context

We formulate the unsupervised loss $\mathcal{L}_u$ as a variant of Skipgram loss

**negative sampling:**

we are sampling $(i, c, γ)$ from a distribution, where $i$ and $c$ denote instance and context respectively, $γ = +1$ means $(i, c)$ is a positive pair and $γ = −1$ means negative. Given $(i, c, γ)$, we minimize the cross entropy loss of classifying the pair $(i, c)$ to a binary label $γ$:

**sampling algorithm:**

**example:**

#### Transductive Formulation

The two hidden layers are concatenated, and fed to a softmax layer to predict the class label of the instance

#### Inductive Formulation

### Conclusion

**contributions:**

- incontrast to previous semi-supervised learning approaches that largely depend on graph Laplacian regularization, we propose a novel approach by joint training of classification and graph context prediction
- since it is difficult to generalize graph embeddings to novel instances, we design a novel inductive approach that conditions embeddings on input features

## ANRL: Attributed Network Representation Learning via Deep Neural Networks

**issues：**

- network topology and node attributes are two heterogeneous information sources, thus it is challenging to preserve their properties in a common vector space
- the observed network data is often incomplete and even noisy, which brings more difficulties for obtaining informative representations

**contributions:**

- integrates network structural proximity and node attributes affinity into low-dimensional representation spaces
- we design a neighbor enhancement autoencoder, which can retain better similarity between data samples in the representation space
- We also propose an attribute-aware skip-gram model to capture the structure correlations
- These two components share connections to the encoder

### Proposed Model

#### problem formulation

we aim to represent:

- each node $i ∈ V$ as a low-dimensional vector $y_i$
- mapping function f preserves not only network structure but also node attribute proximity

#### neighbor enhancement autoencoder

**Loss:**

- $\hat{x}_i$ is the reconstruction output of decoder
- $T(v_i)$ returns the target neighbors of $v_i$
- $T(·)$ incorporates prior knowledge into the model and can be adopted by the following two formulations:
- Weighted Average Neighbor
- Elementwise Median Neighbor

#### attribute-aware skip-gram model

**Loss:**

where:

- $v_{i+j}$ is the node context in the generated random sequence
- $b$ is the window size

**negative sampling:**

for a specific node-context pair $(v_i,v_{i+j})$, we have the following objective:

where:

- $P_n(v) ∝ d_v^{3/4}$ as suggested in [Mikolov et al., 2013b]
- where $d_v$ is the degree of node v

#### joint optimization

**Loss:**

### Conclusions

- incorporates both node attributes and network structure information into the embedding
- To further address the structure proximity and attribute affinity preserving, we develop a neighbor enhancement autoencoder and attribute-aware skip-gram model to exploit the complex interrelations between structural information and attributes

## DeepWalk: Online Learning of Social Representations

the sparsity of a network representation is both a strength and a weakness:

- Sparsity enables the design of efficient discrete algorithms
- but can make it harder to generalize in statistical learning

We develop an algorithm (DeepWalk) that learns social representations of a graph’s vertices, by modeling a stream of short random walks

DeepWalk takes a graph as input and produces a latent representation as an output

We propose an unsupervised method which learns features that capture the graph structure independent of the labels’ distribution.

### Learning Social Representations

We seek learning social representations with the following characteristics:

**Adaptability**- Real social networks are constantly evolving; new social relations should not require repeating the learning process all over again.**Community aware**- The distance between latent dimensions should represent a metric for evaluating social similarity between the corresponding members of the network. This allows generalization in networks with homophily.**Low dimensional**- When labeled data is scarce, low-dimensional models generalize better, and speed up convergence and inference.**Continuous**- We require latent representations to model partial community membership in continuous space. In addition to providing a nuanced view of community membership, a continuous representation has smooth decision boundaries between communities which allows more robust classification.

#### random walks

We denote a random walk rooted at vertex $v_i$ as $W_{v_i}$ . It is a stochastic process with random variables $W_{v_i}^1 ,W_{v_i}^2 ,…,W_{v_i}^k$

such that $W_{v_i}^{k + 1}$ is a vertex chosen at random from the neighbors of vertex $v_k$

#### connection: power laws

we observe that the frequency which vertices appear in the short random walks will also follow a power-law distribution

word frequency in natural language follows a similar distribution

#### Language Modeling

**relaxations：**

from

to

- First, the order independence assumption better captures a sense of
*nearness*that is provided by random walks. - Moreover, this relaxation is quite useful for speeding up the training time by building small

models as one vertex is given at a time

### Method

#### DeepWalk

the algorithm consists of two main components:

- a random walk generator
- an update procedure

A walk samples uniformly from the neighbors of the last vertex visited until the maximum $length(t)$ is reached

At the start of each pass we generate a random ordering to traverse the vertices. This is not strictly required, but is well-known to speed up the convergence of stochastic gradient descent.

##### skipgram

To speed the training time, Hierarchical Softmax can be used to approximate the probability distribution

##### hierarchical softmax

we assign the vertices to the leaves of a binary tree, the prediction problem turns into maximizing the probability of a specific path in the tree

Huffman coding is used to reduce the access time of frequent elements in the tree

### Variants

- Streaming
- Non-random walks

## LINE: Large-scale Information Network Embedding

- We propose a novel network embedding model called the “LINE,” which suits arbitrary types of information networks and easily scales to millions of nodes. It has a carefully designed objective function that preserves both the first-order and second-order proximities.
- We propose an edge-sampling algorithm for optimizing the objective. The algorithm tackles the limitation of the classical stochastic gradient decent and improves the effectiveness and efficiency of the inference.

### Problem definition

**First-order Proximity:**

the local pairwise proximity between two vertices

**Second-order Proximity:**

proximity between a pair of vertices (u,v) in a network is the similarity between their neighborhood network structures

### Large-Scale Information Network Embedding

#### LINE with First-order Proximity

define the joint probability between vertex $v_i$ and $v_j$ as follows:

empirical probability:

objective function:

Replacing d(·, ·) with KL-divergence and omitting some constants, we have:

#### LINE with Second-order Proximity

define the joint probability between vertex $v_i$ and $v_j$ as follows:

empirical probability:

objective function:

Replacing d(·, ·) with KL-divergence and omitting some constants, we have:

### Model Optimization

- negative sampling
- Optimization via Edge Sampling