Course link:CS224W: Machine Learning with Graphs
Course video:[Course] Stanford CS224W: Map Machine Learning (2019 Autumn | English)
directory
web (Internet) can essentially see as a Direct Graph, nodes are various web pages, and are hyperlinks between web pages.
For such a direction, we consider two questions:
- given node v v v, which nodes can be reached v v v?
- From node v v vWhich nodes can be reached?
For the first point, we use a collection I n ( v ) = { w ∣ w can reach v } In(v)=\{w| w \text{ can reach }v\} In(v)={
w∣w can reach v}Come, for example, the following picture, I n ( A ) = { A , B , C , E , G } In(A)=\{A,B,C,E,G\} In(A)={
A,B,C,E,G}. For the second point, we use a collection O u t ( v ) = { w ∣ v can reach w } Out(v)=\{w|v \text{ can reach } w\} Out(v)={
w∣v can reach w}to indicate, that is, O u t ( A ) = { A , B , C , D , F } Out(A)=\{A,B,C,D,F\} Out(A)={
A,B,C,D,F}。
Any direction (web) can be represented by the two types below:
type | style | Description |
---|---|---|
Strongly Connected strong connection chart | ![]() |
Any nodes can be reached through a path, that is, I n ( A ) = O u t ( A ) = { A , B , C , D , E } In(A)=Out(A)=\{A,B,C,D,E\} In(A)=Out(A)={ A,B,C,D,E}。 |
Direct ACYCLIC Graph (DAG) has a ringless diagram | ![]() |
There is no ring circuit, node u u ucan reach the node v v v, but nodes v v vCan’t reach the node u u u。 |
So, we can pay attention to the complex networkStrongly Connected Component (SCC). SCC is actually a collection of points S S S, in the collection
There are paths between any two points in any intention of S S S, and ensure that other SCC collection does not include this collection S S S。
For example, the scc of the above figure includes: {a, b, c, g}, {d}, {e}, {f}.
Any orientation of any orientation is the direction without a loop.The following is a proof–
So how do we find nodes
What about the strong connection of v v v? We know by defining, I n ( v ) In(v) In(v)is all can reach the node v v vnode collection, O u t ( v ) Out(v) Out(v)is all nodes v v vThe node collection that can be reached, then the node
The component of v v vis O u t ( v ) ⋂ I n ( v ) = O u t ( v , G ) ⋂ O u t ( v , G ′ ) Out(v) \bigcap In(v)=Out(v,G) \bigcap Out(v,G’) Out(v)⋂In(v)=Out(v,G)⋂Out(v,G′), of which G ′ G’ G′Yes
The reversal of the G G Gedge. We can use the depth of the graph to priority search (DFS)/breadth priority search (BFS) to get O u t ( v , G ) Out(v,G) Out(v,G)and O u t ( v , G ′ ) Out(v,G’) Out(v,G′)。
Use BFS algorithm to calculate I n ( v ) In(v) In(v)and O u t ( v ) Out(v) Out(v), the number of nodes visited is relatively different. (The BFS Either Visits Many Nodes or Very Few)
Through the experiment, it was found that on the web network, for a random node v v vLet’s talk about it, O u t ( v ) ≈ 100 million (50% nodes) Out(v) \approx 100 \text{ million (50\% nodes)} Out(v)≈100 million (50% nodes), and I n ( v ) ≈ 100 million (50% nodes) In(v) \approx 100 \text{ million (50\% nodes)} In(v)≈100 million (50% nodes); Its great strong connection components include 56 Million nodes, which is about 28%of all nodes.
In this way, the calculation diagram of the web can be represented in the following form:
The starting point of
Pagerank is that the nodes in the network are not equally important). We will introduce the following link analysis methods to calculate the importance of nodes in the graph:
- PageRank
- Personalized PageRank
- Random Walk with Restarts
Obviously, the importance of nodes can be reflected through its link -Page is more important if it has more links. Of course, these links can be reflected in two aspects: In-COMING Links and Out-Going Links. At the same time, the importance of node external links can also reflect its own importance. Links from Important Pages Count More -this is a recursive problem.
In general:
The importance ofA “vote” from an important page is worth more.
Each node will be passed to its external link node on average. Set nodes
The important degree of i i iis r i r_i ri, there is d i d_i diOut-links, then the important degree it passed to these nodes is r i d i \frac {r_i}{d_i} diri. For example, the node in the figure j j jimportant level r j = r i 3 + r k 4 r_j=\frac{r_i}{3}+\frac{r_k}{4} rj=3ri+4rk。
, that is,A page is important if it is pointed to by other important pages. At the same time, we can define nodes
Ranking of j j j:
r j = ∑ i → j r i d i r_j=\sum_{i \to j} \frac{r_i}{d_i} rj=i→j∑diri
among them d i d_i diis node
Out degree. i i i
Let’s give an example:
We can get Flow Equation:
r y = r y 2 + r a 2 r a = r y 2 + r m r m = r a 2 \begin{aligned} r_y &=\frac{r_y}{2}+\frac{r_a}{2} \\ r_a &=\frac{r_y}{2}+r_m \\ r_m &=\frac{r_a}{2} \\ \end{aligned} ryrarm=2ry+2ra=2ry+rm=2ra
Here is three unknown, three equations, can solve related parameters.
We can write the above Flow Equation as a matrix form:
r = M ⋅ r , r j = ∑ i → j r i d i r=M \cdot r, \text{ } r_j=\sum_{i \to j} \frac{r_i}{d_i} r=M⋅r, rj=i→j∑diri
among them M M Mis a column random matrix. Set nodes j j j‘s output is d j d_j dj, if j → i j \to i j→i, then M i j = 1 d j M_{ij}=\frac{1}{d_j} Mij=dj1, that is, the matrix
The column and 1 of M M Mare 1.
r i r_i riis the webpage
The important degree of i i i, ∑ i r i = 1 \sum_i r_i=1 ∑iri=1。
Suppose someone is browsing the web, at time
At t t t, he was browsing the web i i i; at time
At t + 1 t+1 t+1, this person will choose the webpage randomly i i iexternal link j j jAccess; then when surfing online, the above process will be repeated.
So, set the orientation p ( t ) p(t) p(t)indicates the state probability, p ( t ) i p(t)_i p(t)iindicates at time t t tLocated on the webpage
The probability of i i i; that is, p ( t ) p(t) p(t)is based on the probability distribution of webpages.
So, in time t + 1 t+1 t+1, p ( t + 1 ) = M ⋅ p ( t ) p(t+1)=M \cdot p(t) p(t+1)=M⋅p(t). if p ( t + 1 ) = M ⋅ p ( t ) = p ( t ) p(t+1)=M \cdot p(t)=p(t) p(t+1)=M⋅p(t)=p(t), so we call it p ( t ) p(t) p(t)is a stamentary distribution of a random Walk.
Because Flow Equation is written as a matrix form r = M ⋅ r r=M \cdot r r=M⋅r, so r r ris a matrix M M MFeature vector. We can solve the method of ** power iteration (Power Iteration) ** r r r。
Power Iteration
Suppose there is one with N N NThe web map of the node, where the node is the page and the hyperlink. Power iteration is a simple solution r r riteration scheme, you can allocate an initial r r rvalue, repeat the following steps until convergence ( ∑ i ∣ r ( t + 1 ) − r t ∣ 1 < ε \sum_i |r^{(t+1)-r^{t}}|_1<\varepsilon ∑i∣r(t+1)−rt∣1<ε)。
- Initialization- r ( 0 ) = [ 1 N , ⋯ , 1 N ] T r^{(0)}=[\frac{1}{N},\cdots,\frac{1}{N}]^T r(0)=[N1,⋯,N1]T。
- iteration- r ( t + 1 ) = M ⋅ r ( t ) r^{(t+1)}=M \cdot r^{(t)} r(t+1)=M⋅r(t), r j ( t + 1 ) = ∑ i → j r i ( t ) d i r_j^{(t+1)}=\sum_{i \to j}\frac{r_i^{(t)}}{d_i} rj(t+1)=∑i→jdiri(t)。
- iteration termination conditions- ∣ r ( t + 1 ) − r t ∣ 1 < ε |r^{(t+1)-r^{t}}|_1<\varepsilon ∣r(t+1)−rt∣1<ε。
But can this algorithm be converged? Does its convergence results meet our expectations? Is convergence reasonable?
Actually, there are two problems in this algorithm:
(1) SOME PAGES Are DEAD ENDS -Some pages have no external links.
Solution:
(2)Spider traps——(all out-links are within the group)。
Spider Traps solution:
Therefore, we can get the Pagerank equation–
r j = ∑ i → j β r i d i + ( 1 − β ) 1 N r_j=\sum_{i \to j} \beta \frac{r_i}{d_i}+(1-\beta) \frac{1}{N} rj=i→j∑βdiri+(1−β)N1
Write into matrix form:
r = A ⋅ r r=A \cdot r r=A⋅r
among them A A Ais Google Matrix, A = β M + ( 1 − β ) [ 1 N ] N × N A=\beta M+(1-\beta)[\frac{1}{N}]_{N \times N} A=βM+(1−β)[N1]N×N. here β \beta βusually take 0.8,0.9.
The following is an example:
The calculation process of
Pagesk is mainly matrix multiplication r n e w = A ⋅ r o l d r^{new}=A \cdot r^{old} rnew=A⋅rold. But there is a problem that when the number of nodes is large, it takes a lot of memory.
We can re -organize the form of the PageRANK equation:
r = β M ⋅ r + [ 1 − β N ] N r=\beta M \cdot r+[\frac{1-\beta}{N}]_N r=βM⋅r+[N1−β]N
Here, matrix M M Mis a sparse matrix, and [ 1 − β N ] N [\frac{1-\beta}{N}]_N [N1−β]Nis a dense matrix. Therefore, in order to facilitate calculations, we calculate in each round of iteration r n e w = β M ⋅ r o l d r^{new}=\beta M \cdot r^{old} rnew=βM⋅rold, and then r n e w r^{new} rnewBased on a constant value ( 1 − β ) / N (1-\beta)/N (1−β)/N。