CS224W diagram neural network learning notes (11) Link Analysis: Pagerank

2023-01-05   ES  

Course link:CS224W: Machine Learning with Graphs
Course video:[Course] Stanford CS224W: Map Machine Learning (2019 Autumn | English)

web (Internet) can essentially see as a Direct Graph, nodes are various web pages, and are hyperlinks between web pages.

在这里插入图片描述
For such a direction, we consider two questions:

  • given node v v v, which nodes can be reached v v v
  • From node v v vWhich nodes can be reached?

For the first point, we use a collection I n ( v ) = { w ∣ w  can reach  v } In(v)=\{w| w \text{ can reach }v\} In(v)={
ww can reach v}
Come, for example, the following picture, I n ( A ) = { A , B , C , E , G } In(A)=\{A,B,C,E,G\} In(A)={
A,B,C,E,G}
. For the second point, we use a collection O u t ( v ) = { w ∣ v  can reach  w } Out(v)=\{w|v \text{ can reach } w\} Out(v)={
wv can reach w}
to indicate, that is, O u t ( A ) = { A , B , C , D , F } Out(A)=\{A,B,C,D,F\} Out(A)={
A,B,C,D,F}

在这里插入图片描述
Any direction (web) can be represented by the two types below:

type style Description
Strongly Connected strong connection chart 在这里插入图片描述 Any nodes can be reached through a path, that is, I n ( A ) = O u t ( A ) = { A , B , C , D , E } In(A)=Out(A)=\{A,B,C,D,E\} In(A)=Out(A)={
A,B,C,D,E}
Direct ACYCLIC Graph (DAG) has a ringless diagram 在这里插入图片描述 There is no ring circuit, node u u ucan reach the node v v v, but nodes v v vCan’t reach the node u u u

So, we can pay attention to the complex networkStrongly Connected Component (SCC). SCC is actually a collection of points S S S, in the collection

There are paths between any two points in any intention of S S S, and ensure that other SCC collection does not include this collection S S S

在这里插入图片描述
For example, the scc of the above figure includes: {a, b, c, g}, {d}, {e}, {f}.

Any orientation of any orientation is the direction without a loop.The following is a proof–

So how do we find nodes

What about the strong connection of v v v? We know by defining, I n ( v ) In(v) In(v)is all can reach the node v v vnode collection, O u t ( v ) Out(v) Out(v)is all nodes v v vThe node collection that can be reached, then the node

The component of v v vis O u t ( v ) ⋂ I n ( v ) = O u t ( v , G ) ⋂ O u t ( v , G ′ ) Out(v) \bigcap In(v)=Out(v,G) \bigcap Out(v,G’) Out(v)In(v)=Out(v,G)Out(v,G), of which G ′ G’ GYes

The reversal of the G G Gedge. We can use the depth of the graph to priority search (DFS)/breadth priority search (BFS) to get O u t ( v , G ) Out(v,G) Out(v,G)and O u t ( v , G ′ ) Out(v,G’) Out(v,G)

在这里插入图片描述
Use BFS algorithm to calculate I n ( v ) In(v) In(v)and O u t ( v ) Out(v) Out(v), the number of nodes visited is relatively different. (The BFS Either Visits Many Nodes or Very Few)

在这里插入图片描述
Through the experiment, it was found that on the web network, for a random node v v vLet’s talk about it, O u t ( v ) ≈ 100  million (50% nodes) Out(v) \approx 100 \text{ million (50\% nodes)} Out(v)100 million (50% nodes), and I n ( v ) ≈ 100  million (50% nodes) In(v) \approx 100 \text{ million (50\% nodes)} In(v)100 million (50% nodes); Its great strong connection components include 56 Million nodes, which is about 28%of all nodes.

In this way, the calculation diagram of the web can be represented in the following form:

The starting point of

Pagerank is that the nodes in the network are not equally important). We will introduce the following link analysis methods to calculate the importance of nodes in the graph:

  • PageRank
  • Personalized PageRank
  • Random Walk with Restarts

Obviously, the importance of nodes can be reflected through its link -Page is more important if it has more links. Of course, these links can be reflected in two aspects: In-COMING Links and Out-Going Links. At the same time, the importance of node external links can also reflect its own importance. Links from Important Pages Count More -this is a recursive problem.

In general:

The importance ofA “vote” from an important page is worth more.
在这里插入图片描述
Each node will be passed to its external link node on average. Set nodes

The important degree of i i iis r i r_i ri, there is d i d_i diOut-links, then the important degree it passed to these nodes is r i d i \frac {r_i}{d_i} diri. For example, the node in the figure j j jimportant level r j = r i 3 + r k 4 r_j=\frac{r_i}{3}+\frac{r_k}{4} rj=3ri+4rk

, that is,A page is important if it is pointed to by other important pages. At the same time, we can define nodes

Ranking of j j j:

r j = ∑ i → j r i d i r_j=\sum_{i \to j} \frac{r_i}{d_i} rj=ijdiri

among them d i d_i diis node

Out degree. i i i

Let’s give an example:

在这里插入图片描述
We can get Flow Equation:

r y = r y 2 + r a 2 r a = r y 2 + r m r m = r a 2 \begin{aligned} r_y &=\frac{r_y}{2}+\frac{r_a}{2} \\ r_a &=\frac{r_y}{2}+r_m \\ r_m &=\frac{r_a}{2} \\ \end{aligned} ryrarm=2ry+2ra=2ry+rm=2ra

Here is three unknown, three equations, can solve related parameters.

We can write the above Flow Equation as a matrix form:

r = M ⋅ r ,   r j = ∑ i → j r i d i r=M \cdot r, \text{ } r_j=\sum_{i \to j} \frac{r_i}{d_i} r=Mr, rj=ijdiri

among them M M Mis a column random matrix. Set nodes j j j‘s output is d j d_j dj, if j → i j \to i ji, then M i j = 1 d j M_{ij}=\frac{1}{d_j} Mij=dj1, that is, the matrix

The column and 1 of M M Mare 1.

r i r_i riis the webpage

The important degree of i i i, ∑ i r i = 1 \sum_i r_i=1 iri=1

在这里插入图片描述
Suppose someone is browsing the web, at time

At t t t, he was browsing the web i i i; at time

At t + 1 t+1 t+1, this person will choose the webpage randomly i i iexternal link j j jAccess; then when surfing online, the above process will be repeated.

So, set the orientation p ( t ) p(t) p(t)indicates the state probability, p ( t ) i p(t)_i p(t)iindicates at time t t tLocated on the webpage

The probability of i i i; that is, p ( t ) p(t) p(t)is based on the probability distribution of webpages.

在这里插入图片描述
So, in time t + 1 t+1 t+1, p ( t + 1 ) = M ⋅ p ( t ) p(t+1)=M \cdot p(t) p(t+1)=Mp(t). if p ( t + 1 ) = M ⋅ p ( t ) = p ( t ) p(t+1)=M \cdot p(t)=p(t) p(t+1)=Mp(t)=p(t), so we call it p ( t ) p(t) p(t)is a stamentary distribution of a random Walk.

Because Flow Equation is written as a matrix form r = M ⋅ r r=M \cdot r r=Mr, so r r ris a matrix M M MFeature vector. We can solve the method of ** power iteration (Power Iteration) ** r r r

Power Iteration

Suppose there is one with N N NThe web map of the node, where the node is the page and the hyperlink. Power iteration is a simple solution r r riteration scheme, you can allocate an initial r r rvalue, repeat the following steps until convergence ( ∑ i ∣ r ( t + 1 ) − r t ∣ 1 < ε \sum_i |r^{(t+1)-r^{t}}|_1<\varepsilon ir(t+1)rt1<ε)。

  • Initialization- r ( 0 ) = [ 1 N , ⋯   , 1 N ] T r^{(0)}=[\frac{1}{N},\cdots,\frac{1}{N}]^T r(0)=[N1,,N1]T
  • iteration- r ( t + 1 ) = M ⋅ r ( t ) r^{(t+1)}=M \cdot r^{(t)} r(t+1)=Mr(t) r j ( t + 1 ) = ∑ i → j r i ( t ) d i r_j^{(t+1)}=\sum_{i \to j}\frac{r_i^{(t)}}{d_i} rj(t+1)=ijdiri(t)
  • iteration termination conditions- ∣ r ( t + 1 ) − r t ∣ 1 < ε |r^{(t+1)-r^{t}}|_1<\varepsilon r(t+1)rt1<ε

在这里插入图片描述
But can this algorithm be converged? Does its convergence results meet our expectations? Is convergence reasonable?

Actually, there are two problems in this algorithm:

(1) SOME PAGES Are DEAD ENDS -Some pages have no external links.
在这里插入图片描述

Solution:

(2)Spider traps——(all out-links are within the group)
在这里插入图片描述

Spider Traps solution:

Therefore, we can get the Pagerank equation–

r j = ∑ i → j β r i d i + ( 1 − β ) 1 N r_j=\sum_{i \to j} \beta \frac{r_i}{d_i}+(1-\beta) \frac{1}{N} rj=ijβdiri+(1β)N1

Write into matrix form:

r = A ⋅ r r=A \cdot r r=Ar

among them A A Ais Google Matrix, A = β M + ( 1 − β ) [ 1 N ] N × N A=\beta M+(1-\beta)[\frac{1}{N}]_{N \times N} A=βM+(1β)[N1]N×N. here β \beta βusually take 0.8,0.9.

The following is an example:

The calculation process of

Pagesk is mainly matrix multiplication r n e w = A ⋅ r o l d r^{new}=A \cdot r^{old} rnew=Arold. But there is a problem that when the number of nodes is large, it takes a lot of memory.

We can re -organize the form of the PageRANK equation:

r = β M ⋅ r + [ 1 − β N ] N r=\beta M \cdot r+[\frac{1-\beta}{N}]_N r=βMr+[N1β]N

Here, matrix M M Mis a sparse matrix, and [ 1 − β N ] N [\frac{1-\beta}{N}]_N [N1β]Nis a dense matrix. Therefore, in order to facilitate calculations, we calculate in each round of iteration r n e w = β M ⋅ r o l d r^{new}=\beta M \cdot r^{old} rnew=βMrold, and then r n e w r^{new} rnewBased on a constant value ( 1 − β ) / N (1-\beta)/N (1β)/N

source

Random Posts

SpringBoot Three processing methods that are abnormally thrown by AuthorizationException when using shiro

[ubuntu environment-Docker builds Jenkins, installs Python automated test environment]

Quartz Deploy Table ‘HeartBeat.qrtzLocks’ DOESN’T Exist

FreeMarker page static

Use COMMORK and Flexmark to convert the Markdown format text into HTML format textpo in Java