# CS224W diagram neural network learning notes (11) Link Analysis: Pagerank

2023-01-05   ES

web (Internet) can essentially see as a Direct Graph, nodes are various web pages, and are hyperlinks between web pages.

For such a direction, we consider two questions:

• given node v v , which nodes can be reached v v
• From node v v Which nodes can be reached?

For the first point, we use a collection I n ( v ) = { w ∣ w  can reach  v } In(v)=\{w| w \text{ can reach }v\} Come, for example, the following picture, I n ( A ) = { A , B , C , E , G } In(A)=\{A,B,C,E,G\} . For the second point, we use a collection O u t ( v ) = { w ∣ v  can reach  w } Out(v)=\{w|v \text{ can reach } w\} to indicate, that is, O u t ( A ) = { A , B , C , D , F } Out(A)=\{A,B,C,D,F\}

Any direction (web) can be represented by the two types below:

type style Description
Strongly Connected strong connection chart Any nodes can be reached through a path, that is, I n ( A ) = O u t ( A ) = { A , B , C , D , E } In(A)=Out(A)=\{A,B,C,D,E\}
Direct ACYCLIC Graph (DAG) has a ringless diagram There is no ring circuit, node u u can reach the node v v , but nodes v v Can’t reach the node u u

So, we can pay attention to the complex networkStrongly Connected Component (SCC). SCC is actually a collection of points S S , in the collection

There are paths between any two points in any intention of S S , and ensure that other SCC collection does not include this collection S S

For example, the scc of the above figure includes: {a, b, c, g}, {d}, {e}, {f}.

Any orientation of any orientation is the direction without a loop.The following is a proof–

So how do we find nodes

What about the strong connection of v v ? We know by defining, I n ( v ) In(v) is all can reach the node v v node collection, O u t ( v ) Out(v) is all nodes v v The node collection that can be reached, then the node

The component of v v is O u t ( v ) ⋂ I n ( v ) = O u t ( v , G ) ⋂ O u t ( v , G ′ ) Out(v) \bigcap In(v)=Out(v,G) \bigcap Out(v,G’) , of which G ′ G’ Yes

The reversal of the G G edge. We can use the depth of the graph to priority search (DFS)/breadth priority search (BFS) to get O u t ( v , G ) Out(v,G) and O u t ( v , G ′ ) Out(v,G’)

Use BFS algorithm to calculate I n ( v ) In(v) and O u t ( v ) Out(v) , the number of nodes visited is relatively different. (The BFS Either Visits Many Nodes or Very Few)

Through the experiment, it was found that on the web network, for a random node v v Let’s talk about it, O u t ( v ) ≈ 100  million (50% nodes) Out(v) \approx 100 \text{ million (50\% nodes)} , and I n ( v ) ≈ 100  million (50% nodes) In(v) \approx 100 \text{ million (50\% nodes)} ; Its great strong connection components include 56 Million nodes, which is about 28%of all nodes.

In this way, the calculation diagram of the web can be represented in the following form:

The starting point of

Pagerank is that the nodes in the network are not equally important). We will introduce the following link analysis methods to calculate the importance of nodes in the graph:

• PageRank
• Personalized PageRank
• Random Walk with Restarts

Obviously, the importance of nodes can be reflected through its link -Page is more important if it has more links. Of course, these links can be reflected in two aspects: In-COMING Links and Out-Going Links. At the same time, the importance of node external links can also reflect its own importance. Links from Important Pages Count More -this is a recursive problem.

In general:

The importance ofA “vote” from an important page is worth more.

Each node will be passed to its external link node on average. Set nodes

The important degree of i i is r i r_i , there is d i d_i Out-links, then the important degree it passed to these nodes is r i d i \frac {r_i}{d_i} . For example, the node in the figure j j important level r j = r i 3 + r k 4 r_j=\frac{r_i}{3}+\frac{r_k}{4}

, that is,A page is important if it is pointed to by other important pages. At the same time, we can define nodes

Ranking of j j :

r j = ∑ i → j r i d i r_j=\sum_{i \to j} \frac{r_i}{d_i}

among them d i d_i is node

Out degree. i i

Let’s give an example:

We can get Flow Equation:

r y = r y 2 + r a 2 r a = r y 2 + r m r m = r a 2 \begin{aligned} r_y &=\frac{r_y}{2}+\frac{r_a}{2} \\ r_a &=\frac{r_y}{2}+r_m \\ r_m &=\frac{r_a}{2} \\ \end{aligned}

Here is three unknown, three equations, can solve related parameters.

We can write the above Flow Equation as a matrix form:

r = M ⋅ r ,   r j = ∑ i → j r i d i r=M \cdot r, \text{ } r_j=\sum_{i \to j} \frac{r_i}{d_i}

among them M M is a column random matrix. Set nodes j j ‘s output is d j d_j , if j → i j \to i , then M i j = 1 d j M_{ij}=\frac{1}{d_j} , that is, the matrix

The column and 1 of M M are 1.

r i r_i is the webpage

The important degree of i i , ∑ i r i = 1 \sum_i r_i=1

Suppose someone is browsing the web, at time

At t t , he was browsing the web i i ; at time

At t + 1 t+1 , this person will choose the webpage randomly i i external link j j Access; then when surfing online, the above process will be repeated.

So, set the orientation p ( t ) p(t) indicates the state probability, p ( t ) i p(t)_i indicates at time t t Located on the webpage

The probability of i i ; that is, p ( t ) p(t) is based on the probability distribution of webpages.

So, in time t + 1 t+1 , p ( t + 1 ) = M ⋅ p ( t ) p(t+1)=M \cdot p(t) . if p ( t + 1 ) = M ⋅ p ( t ) = p ( t ) p(t+1)=M \cdot p(t)=p(t) , so we call it p ( t ) p(t) is a stamentary distribution of a random Walk.

Because Flow Equation is written as a matrix form r = M ⋅ r r=M \cdot r , so r r is a matrix M M Feature vector. We can solve the method of ** power iteration (Power Iteration) ** r r

Power Iteration

Suppose there is one with N N The web map of the node, where the node is the page and the hyperlink. Power iteration is a simple solution r r iteration scheme, you can allocate an initial r r value, repeat the following steps until convergence ( ∑ i ∣ r ( t + 1 ) − r t ∣ 1 < ε \sum_i |r^{(t+1)-r^{t}}|_1<\varepsilon ）。

• Initialization- r ( 0 ) = [ 1 N , ⋯   , 1 N ] T r^{(0)}=[\frac{1}{N},\cdots,\frac{1}{N}]^T
• iteration- r ( t + 1 ) = M ⋅ r ( t ) r^{(t+1)}=M \cdot r^{(t)} r j ( t + 1 ) = ∑ i → j r i ( t ) d i r_j^{(t+1)}=\sum_{i \to j}\frac{r_i^{(t)}}{d_i}
• iteration termination conditions- ∣ r ( t + 1 ) − r t ∣ 1 < ε |r^{(t+1)-r^{t}}|_1<\varepsilon

But can this algorithm be converged? Does its convergence results meet our expectations? Is convergence reasonable?

Actually, there are two problems in this algorithm:

(1) SOME PAGES Are DEAD ENDS -Some pages have no external links.

Solution:

（2）Spider traps——(all out-links are within the group)

Spider Traps solution:

Therefore, we can get the Pagerank equation–

r j = ∑ i → j β r i d i + ( 1 − β ) 1 N r_j=\sum_{i \to j} \beta \frac{r_i}{d_i}+(1-\beta) \frac{1}{N}

Write into matrix form:

r = A ⋅ r r=A \cdot r

among them A A is Google Matrix, A = β M + ( 1 − β ) [ 1 N ] N × N A=\beta M+(1-\beta)[\frac{1}{N}]_{N \times N} . here β \beta usually take 0.8,0.9.

The following is an example:

The calculation process of

Pagesk is mainly matrix multiplication r n e w = A ⋅ r o l d r^{new}=A \cdot r^{old} . But there is a problem that when the number of nodes is large, it takes a lot of memory.

We can re -organize the form of the PageRANK equation:

r = β M ⋅ r + [ 1 − β N ] N r=\beta M \cdot r+[\frac{1-\beta}{N}]_N

Here, matrix M M is a sparse matrix, and [ 1 − β N ] N [\frac{1-\beta}{N}]_N is a dense matrix. Therefore, in order to facilitate calculations, we calculate in each round of iteration r n e w = β M ⋅ r o l d r^{new}=\beta M \cdot r^{old} , and then r n e w r^{new} Based on a constant value ( 1 − β ) / N (1-\beta)/N

source