Divide and conquer approach for efficient pagerank computation

June 30, 2017 | Autor: Nishith Pathak | Categoría: Markov-chain model, World Wide Web, Efficient Algorithm for ECG Coding, Computational Efficiency, First-Order Logic, Graph Partitioning, Divide and Conquer, Graph Partitioning, Divide and Conquer

Share Embed

Laporkan tautan ini

Descripción

Divide and Conquer Approach for Efficient PageRank Computation Prasanna Desikan

Nishith Pathak

Jaideep Srivastava

Vipin Kumar

Dept. of Computer Science University of Minnesota Minneapolis, MN 55455 USA [email protected]

Dept. of Computer Science University of Minnesota Minneapolis, MN 55455 USA [email protected]

Dept. of Computer Science University of Minnesota Minneapolis, MN 55455 USA [email protected]

Dept. of Computer Science University of Minnesota Minneapolis, MN 55455 USA [email protected]

ABSTRACT PageRank is a popular ranking metric for large graphs such as the World Wide Web. Current research techniques for improving computational efficiency of PageRank have focussed on improving the I/O cost, convergence and parallelizing the computation process. In this paper, we propose a “divide and conquer” strategy for efficient computation of PageRank. The strategy is different from contemporary improvements in that it can be combined with any existing enhancements to PageRank, giving way to an entire class of more efficient algorithms. We present a novel graph-partitioning technique for dividing the graph into subgraphs, on which computation can be performed independently. This approach has two significant benefits. Firstly, since the approach focuses on work-reduction, it can be combined with any existing enhancements to PageRank. Secondly, the proposed approach leads naturally into developing an incremental approach for computation of such ranking metrics given that these large graphs evolve over a period of time. The partitioning technique is both lossless and independent of the type (variant) of PageRank computation algorithm used. The experimental results for a static single graph (graph at a single time instance) as well as for the incremental computation in case of evolving graphs, illustrate the utility of our novel partitioning approach. The proposed approach can also be applied for the computation of any other metric based on first order Markov chain model.

Categories and Subject Descriptors G.4. [Mathematical Software]: Efficiency, Algorithm Design and Analysis.

General Terms Algorithms, Performance, Design, Theory

Keywords PageRank, Efficient Computation, Ranking Measures, Graph Partitioning

1. INTRODUCTION Copyright is held by the author/owner(s). ICWE'06, July 11-14, 2006, Palo Alto, California, USA. ACM 1-59593-352-2/06/0007

Link analysis techniques have been used widely for developing ranking metrics in large graphs such as the Web. The principal observation is that a hyperlink from a source page to a destination page serves as an endorsement of the destination page by the (author of the) source page on some topic. Link based metrics for Web graphs have been found to provide stable rankings for Web search avoiding issues related to text spamming. Information on various link based metrics, such as Klienbergs HITS algorithm [2], is discussed in the survey [3]. Among the different link based metrics on the Web graph, PageRank metric [1] has gained significant prominence with the success of Google. The primary key to its success has been the dependence of rank on pages pointing to it, thus reducing the chances of biasing a rank for a page for which the user is the creator. Secondly, PageRank is precomputed for the whole Web graph and is query independent making it a faster approach to rank results during a search operation. The popularity and stability of PageRank has led to a variety of modifications to the underlying PageRank model addressing different scenarios such as topic sensitivity [4], usage analysis [5], and biased among different clusters[6]. The issue of efficient computation for PageRank has also captured attention of the research community. Haveliwala proposed an efficient computation approach for PageRank [6] using a block based technique using efficient i/o computation. Improvisations for such I/O efficient methods [7,8]and accelerated convergence [9, 10] for PageRank have also been well studied. The various issues related to PageRank computation have been extensively covered in recent surveys [11, 12]. In this paper our contribution is two fold. Firstly, we propose work reduction techniques through graph partitioning to break the problem of computation on a large graph into computation on smaller subgraphs. The partitioning approach is not an approximation method and hence does not result in loss of information. Also, this approach of work reduction is complementary to other approaches, and hence can be used in tandem with other efficient computation methods to further improve the efficiency. In the second part of our work, we address the issue of computation on evolving Web graph. A straightforward approach would be to compute these measures for the whole Web Graph at each time instance. However, given the size of the Web graph, this is becoming increasingly infeasible. Furthermore, if the percent of vertices that change during a typical time interval when the Web is crawled by search engines is not high, a large portion

of the computation cost may be wasted on re-computing the scores for the unchanged portion. Hence, there is a need for computing metrics incrementally, to save on the computation costs. Chien et al [13] propose an approximation approach to compute incrementally PageRank. However, our approach relies on sound theoretical partitioning criterion that results in a lossless incremental computation of PageRank. Initial work on the incremental approach was presented earlier [14]. Our results indicate that we achieve significant improvement in terms of computation time for such an approach. This paper is organized as follows. In the next section, we give an overview of PageRank metric and its underlying model. In Section 3, we describe the theoretical framework of our proposed divide and conquer approach. Section 4 discusses the methodology to use the above mentioned approach to compute PageRank for a large graph at a single time instance and the extension of this approach to the computation of PageRank on such large evolving graphs. Experiments and results supporting the approach are presented in Section 5. Section 6 provides conclusions of our approach and discusses possible future work.

2. PAGERANK OVERVIEW PageRank is a metric for ranking hypertext documents that determines their quality. It was originally developed by Page et al. [1] for the popular search engine, Google [14]. The key idea is that a page has high rank if it is pointed to by many highly ranked pages. Thus, the rank of a page depends upon the ranks of the pages pointing to it. The rank of a page p can thus be written as: PR( p) = d + (1 − d ) ⋅ ∑ PR(q) n OutDegree(q) ( q, p)∈G

(1)

Here, n is the number of vertices in the graph and OutDegree(q) is the number of hyperlinks on page q. Intuitively, the approach can be viewed as a stochastic analysis of a random walk on the Web graph. The first term in the right hand side of the equation corresponds to the probability that a random Web surfer arrives at a page p from somewhere, i.e. (s)he could arrive at the page by typing the URL or from a bookmark, or may have a particular page as his/her homepage. d would then be the probability that a random surfer chooses a URL directly – i.e. typing it, using the d

P1

P2

1 OutDeg ( P1)

There are other computational challenges that arise in PageRank. Apart from the issue of scalability, the other important computational issues are the convergence of PageRank iteration and the handling of dangling vertices. The convergence of PageRank is guaranteed only if the Web graph is strongly connected and is aperiodic. To ensure the condition of strong connectedness, the dampening factor is introduced, which assigns a uniform probability to jumping to any page. In a graph theoretic sense it is equivalent of adding an edge between every pair of vertices with a transition probability of d/n. The aperiodic property is also guaranteed for the Web graph. Another important issue in computation of PageRank is the handling of dangling vertices. Dangling vertices are vertices with no outgoing edge. These vertices tend to act as rank sink, as there is no way for rank to be distributed among the other vertices. The suggestion made initially to address this problem, was to iteratively remove all the vertices that have an outdegree of zero, and compute the PageRank on the remaining vertices [1]. The reasoning here was that dangling vertices do not affect the PageRank of other vertices. Another suggested approach was to remove the dangling vertices while computation initially and add them back during the final iterations of the computation [15]. Other popular approaches to handling dangling vertices, is to add self loops to dangling vertices[16,17] and to add links to all vertices in the graph, G from each of the dangling vertex to distribute the PageRank of the dangling vertex uniformly among all vertices[1]. In this paper we handle dangling vertices by adding self loops to all vertices.

3. PROPOSED APPROACH

N

d/N

1 OutDeg ( P 2) 1 OutDeg ( P3)

in the right hand side of the equation corresponds to a factor contributed by arriving at a page by traversing a link. 1- d is the probability that a person arrives at the page p by traversing a link. The summation corresponds to the sum of the rank contributions made by all the pages that point to the page p. The rank contribution is the PageRank of the page multiplied by the probability that a particular link on the page is traversed. So for any page q pointing to page p, the probability that the link pointing to page p is traversed would be 1/OutDegree(q), assuming all links on the page is chosen with uniform probability. Figure 2 illustrates an example of computing PageRank of a page P from the pages, P1, P2, P3 pointing to it.

P

P3

 PR(P1) PR(P2) PR(P3)   PR(P) = d N + (1− d ) + +  OutDeg(P1) OutDeg(P2) OutDeg(P3) 

Figure1. Illustrative Example of PageRank bookmark list, or by default – rather than traversing a link. Finally, 1/n is the uniform probability that a person chooses page p from the complete set of n pages on the Web. The second term

In the proposed approach we make use of the fact that the PageRank is based on first order Markov model. And in such a model if a vertex belonging to one set cannot be reached from a vertex belonging to any other set, then the score on this vertex would depend only on the vertices of the set to which it belongs. This is because in a first order Markov model, the present state depends upon one previous state and to arrive at the present state we need to have an incoming link from a previous vertex. This leads to the idea that PageRank of the vertices belonging to a set A, does not depend on the PageRank of the vertices from another set B if there is no incoming links from vertices in set B to vertices in set A. In such a scenario, PageRank of vertices in set A could be computed independently of PageRank of vertices in set B. We make use of this criterion, to divide the graphs into partitions of “red patches” and “yellow patches” such that there is no link that point from any “red patch” to another “red patch” or “yellow patch” and there are no outgoing links from a “yellow patch” to any of the red patches. Once we can partition the graph in such a manner into sets of “red patches” and “yellow patches”,

we can then compute the PageRank of vertices in the “red patches” independently and follow it with the computation of PageRank for vertices in the “yellow” patch. Such an approach has two advantages. Firstly, it reduces the size of the problem by reducing the size of the graph into smaller subgraphs of “red patches” and “yellow patches”. Such a reduced problem size helps in fitting the graph in the main memory without requiring a machine of high RAM capacity. Secondly, since the computation of “red patches” can be carried out independently, this process can be parallelized leading to further optimization and saving on computation time. However, we do not deal with parallelization issues in this paper.

G ′ = V ′, E ′

where

 rk  V' =  U Vborder , ri U V y ; E ′ = E partition U E y ri = r1 

This G’ corresponds to the yellow patch. The PageRank for the whole graph G can now be computed by computing PageRanks independently for all individual “red patches”, Gr1 to Grk and then computing for the graph, G’ as defined above. The partitioning scheme presented above is purely based on the fact that the measure of certain nodes is independent of measure of other nodes that are not pointing to it and is dependent only on the previous node pointing to it. Thus this scheme as such is not restricted or specific to PageRank computation but can be used for computation of any measures that are based on such first order Markov model. In this work, our contribution is presentation of such a scheme which can be used for improving computation on a single large graph and more importantly, making use of this technique for efficient incremental computation on evolving graphs. A detailed overview of the incremental approach is discussed in our previous work [18]. The most significant contribution of our approach is it does not result in loss of rank scores, as we make no approximations such as grouping a set of vertices as a single vertex. In the following section, we present algorithms to make such partitions. However, we do not claim such partitions to be optimal or the algorithm to be an optimal approach to partition to graph for the most efficient computation. We show the feasibility of our approach by presenting naïve algorithms that could still be used to improvise the computation costs.

4. EFFICIENT COMPUTATION TECHNIQUES Let us consider a graph G = V , E . The idea is to partition graph G into components, G1, G2,…….Gk , such that: (a) k   U Vi = V ; Vi ∩ V j  = φ i =1  i≠ j  (b)

 k     U Ei  ∪ E partition = E ;  E i ∩ E j  = φ    i≠ j    i =1  

For PageRank which is based on First Order Markov Model further constraints apply to prevent cyclic flow of information, such as for a given partition, Gi:

¬∃ e xy ∈ E ∋ v x ∈ Gi ∧ v y ∈ G − Gi Such a partition, Gi corresponds to the definition of a “red patch” described earlier. In the figure, the graph G is partitioned into four partitions such that G1, G3, G4 correspond to the “red patches” discussed earlier. G2 corresponds to the “yellow” patch. We will now describe the scheme to compute PageRank on such a partitioned graph. Let a graph be partitioned into k “red patches”, Gr1 to Grk and a single yellow patch Gy. The edges from the “red patches” to the “yellow patch” (represented as dotted edges in the figure) form the set Epartition. In a given “red patch”, Gri , let the vertices (represented by annular circles in the figure) that are the source end of an edge belonging to Epartition be denoted as Vborder,ri. Let us define a graph, G’ such that:

In this section we present efficient computation methods for lossless parallelization of PageRank computation for a static single graph as well as an approach for lossless incremental computation of PageRank for an evolving graph. Both these approaches are based on the partitioning scheme presented in the previous section.

4.1 Static Single Graph We present a methodology for parallelizing the PageRank computation for a Static Single Graph, using the partitioning technique presented in the previous section. Assume that we have a graph partitioned into a set of red patches surrounded by a yellow patch. PageRank scores for vertices in the red patches are computed first. The red patches can be treated as independent subgraphs and their PageRank computation is performed in parallel. Next, using the PageRank scores for “peripheral” red colored vertices i.e. red colored vertices with edges crossing over to the yellow patch; we compute the PageRank scores for the vertices in the yellow patch. For PageRank computation of the red patches and the yellow patch, the user is free to choose any variant (or the naïve version itself) of the PageRank algorithm, as the partitioning scheme results in nothing but a reduction of the problem size and is independent of the algorithm used for PageRank computation. Thus, we have a lossless parallel PageRank computation method for a Static Single Graph based on our partitioning scheme.

The user is free to design and use any patch extraction algorithm, provided the patches extracted have the desired properties discussed in section 3. One can perform the patch extraction

is observed that the red patches extracted in steps 1-5 contain a small fraction of the vertices in the graph. An ideal situation would be one where we have the entire graph partitioned into equal sized red patches i.e. no yellow patch. This is because we compute PageRank for the red patches in parallel and the more the vertices in the red patches; the more will be the gain in efficiency. Even though we may not realize the ideal situation, it is still desirable that we include as much possible of the graph, as red patches. Therefore, the second part of the algorithm expands the size i.e. includes as many vertices as possible, in the above obtained red patches. Step 6 – Pick a red patch, call it R Step 7 – Perform a reverse BFS on each of the yellow colored children of the peripheral vertices of this red patch. If during the process of traversal, a red vertex belonging to red patch R is encountered, then all the vertices encountered so far are colored red and included in the red patch R. If a red colored vertex from some other red patch is encountered, then the red patch R cannot be expanded along this path and the reverse BFS is abandoned leaving all vertices as they were. If one encounters a dead end i.e. a yellow vertex with no incoming links then again all vertices encountered so far are colored red and included in the red patch R.

Figure 3. Steps 1 to 4 of the approach during the crawling procedure or as a pre-processing step. For purpose of showing that such patches indeed exist and can be extracted, we have implemented a naïve algorithm. The following steps briefly describe the algorithm:Consider that we have three sets of vertices colored red, yellow and black. Red colored vertices are those which belong to some red patch. Yellow colored vertices belong to the yellow patch. The black colored vertices are the unexplored vertices. Initially, since all the vertices are unexplored, there are no red and yellow colored vertices, all the vertices in the graph are black colored. Step 1 - Randomly pick a black colored vertex. Step 2 – Perform a reverse BFS on this vertex i.e. explore all the ancestors of this vertex by traversing along the incoming links. Color all the black vertices encountered red. The reverse BFS does not stop until no further traversal is possible or a red vertex is encountered. Step 3 - Label this set of red vertices as a red patch. Select all “peripheral” vertices i.e. vertices in this red patch that have edges crossing over to any vertex(s) outside this red patch. Step 4 – Perform a normal BFS on each of the peripheral vertices, coloring all the black vertices encountered as yellow. Each of the BFSs continue till no further traversal is possible or a yellow vertex is encountered. It is not possible for any of the BFSs to encounter a re d vertex. Thus, we color all the descendents of each of the peripheral vertices as yellow. Step 5 – We now have a red patch surrounded by a yellow patch region. Repeat the entire procedure from Step 1-4 (Steps 1-4 are illustrated in figure. x), each time extracting a different red patch, until all the black vertices are exhausted. Note that in step 2 it is neither possible to encounter a red vertex from a previously extracted red patch nor a yellow colored vertex. Also in step 4, it is not possible to encounter a red colored vertex from a previously extracted red patch however one may encounter a yellow vertex. It

Figure 4. Steps 6 to 7 of the approach Step 8 – Perform Steps 6 and 7 for each red patch (Steps 6-7 are illustrated in figure y). Step 9 – Repeat Steps 6-8 until no change is observed in the size of any of the red patches i.e. no red patch can be expanded any more. Step 10 – Return the list of red patches and the yellow colored vertices. These are then used for parallel PageRank computation as explained earlier. After patch extraction, the graph can be treated for dangling vertices by adding a self-loop to each dangling vertex or one may also choose to delete these vertices. The first option will have no effect on the patches extracted. The second option may results in certain vertices being deleted; however, it will not result in a situation where the properties of the patches extracted are violated. One may also choose to perform these operations before the patch extraction process. Adding edges from a dangling vertex to every other vertex has an adverse effect on our partitioning scheme. In such a case, every vertex will have an incoming link from a dangling vertex. Therefore, every red patch will have to contain these dangling vertices. Since all red patches are disjoint, it is not possible to have more than one red patch. Therefore, in spite of its popularity this technique, for handling dangling vertices, will not work with our partitioning scheme. A good partitioning scheme would be where we get patches of a good profile. By a good profile we mean a set of patches that

maximize the total number of vertices included in red patches, minimize the size of the largest red patch and minimize the skew in the red patch sizes. Note that the patch extraction algorithm presented (and implemented by us) is quite naïve and not optimal. For instance we do not take any steps to prevent skew in the red patches sizes. Also note that the profile of the patches (i.e. properties such as number of patches, sizes of patches and skew in patch size) that are extracted from a graph depends on the black colored vertices are picked in step 1. In our case we choose to pick a vertex randomly. This is because we expect a more optimal or “intelligent” picking schemes to come up with better patch profile when compared to a random one such as ours. In section 5 we show that in spite of the sub-optimal patch extraction algorithm used by us, we still get favorable results.

4.2 Evolving Graphs In this section, we will describe the incremental algorithm to compute PageRank. The initial step is to read the graph at a new instance and determine the vertices that have changed. This does not require additional time as it can be computed as we read the new graph. Thus, after reading the graph, we can assume that we are given two sets of vertices – one containing the vertices which have changed from a previous time instance and the other containing vertices that have remain unchanged. Hence, the input to the algorithm is the graph G, and the two lists Vc and Vu.. The outline of the algorithm is shown in Figure 3. We will now briefly describe each step in the algorithm. We start by initializing a list VQ. Recall that, a change in a vertex induces a change in the PageRank distribution of all its children and all such changed vertices are available to us in the queue Vc. A simple traversal methos is used to extend this list of “changed vertices”, such that it also includes all descendents of the initial list of “changed vertices”. All of these vertices are pushed into the list Q2. For the remaining vertices are there is no change in their PageRank distribution. The New PageRank is simply obtained by scaling the previous PageRank scores.

IPR(G,Vu,Vc) :Step 1 – Initialize the list VQ Step 2 – Pop a Vertex N from Vc 2.1 For all the children of N if children of N∈ list Vu remove them from Vu push them in Vc 2.2. Push N in VQ and repeat step 2 till queue Vc is empty Step 3 – For each element in list Vu 3.1 Take the element and scale the previous pagerank value to get new pagerank value. 3.2 Look up whether any of the children, of the element of Vu belong to VQ, if so remove this element of Vu, copy it in Vb . Step 4 – Scale Border Nodes in Vb for stochastic property Perform Original PageRank(VQ U Vb) Figure 5. Incremental PageRank Algorithm

The scaling factor is simply:

n(G ′ )

n(G )

= Order of graph at previous time instance/Order of

the graph at the present time instance. Also note that all those vertices from this set of unchanged vertices that point to a changed vertex, will influence the PageRank value of that changed vertex, hence these too must be included in the list VQ as their PageRank scores will be required for computing the PageRank scores for the changed vertices. We now perform the original PageRank computation along with steps taken to ensure stochastic property of transition matrix, on the vertices that are in Q2 and colored violet (i.e. vertices which have changed) to get the new PageRank values for these changed vertices. Thus, we end up localizing the changed partition to a certain sub-graph of the web which includes of all vertices whose PageRank values are affected by the structural changes in the graph, and then basic PageRank algorithm is performed only on this changed sub-graph. The PageRank value for the rest of the vertices is simply a matter of scaling the previous values. Step 2 has a cost of E’, where E ′ =

EQ − E Part , is the number

of edges in the partition Q. Now the PageRank values for the partition P are obtained by scaling the PageRank values with respect to ranks in the previous time instance. This step requires a cost of V’’, where V’’ is number of vertices in partition P. Now using these scaled values and the naïve approach PageRank for the vertices in partition Q is calculated. This step (including that required to scale the border vertices) requires a cost of nE + E’ +Vb, where n is number of iterations required for PageRank values to converge and E’ is again number of edges in partition Q. Thus, the total cost for incremental PageRank can be summed up to be O(2E’+V’’+nE+Vb).

5. EXPERIMENTS AND RESULTS In this section we present results for our graph partitioning scheme for a static single graph as well as the incremental PageRank computation for an evolving graph. For the graph partitioning scheme we present results on the link graph for http://www.cs.umn.edu website as crawled on July 19th, 2004. The graph contains a total of 52183 edges and 12209 vertices. Prior to partitioning, dangling vertices were taken care of by adding self-loops to each of them. The various statistics for patch extraction are presented in table 1. The center column (labeled ‘number’) provides the actual number of vertices/edges and the next column (labeled ‘percentage’) provides the same information as a percentage value w.r.t the total number of vertices/edges in the graph. In spite of using a website link graph, which one expects to be denser than the web graph, and a naïve and suboptimal patch extraction algorithm, we still get a significant portion of the graph (i.e. 63.5%) in the form of red patches with the largest patch containing 16.94% of the total edges. Note that the sizes in ‘number of edges’ are more important than the ones in ‘number of vertices’, as the PageRank algorithm of O(E). The actual runtime of the patch extraction algorithm was 37 ms., which would reduce the efficiency to 1.57 times faster. However, the patch extraction is not an optimal implementation and hence an improvised patch extraction could further improve the efficiency. One might also perform patch extraction during the crawling procedure

Table 1. Statistics for Patch Extraction Number of Red Patches = 326 Number

Percentage

Total Edges in all Red Patches

33144

63.5 %

Total Edges in Yellow Patch

19039

36.5 %

Vertices in Red Patches

8699

71.25 %

Vertices in Yellow Patch

3510

28.75 %

Edges in Largest Red Patch

8839

16.94 %

Vertices in Largest Red Patch

552

4.5 %

The parallel PageRank computation for a Static Single Graph, were again run on the link graph for http://cs.umn.edu as crawled on July 19th, 2004. Due to the unavailability of a parallel platform for carrying out our experiments, we simulate the results for a parallel execution by using the maximum of the runtimes of the PageRank computations all red patches, in place of the runtime of computing PageRank of all red patches in parallel. This is a valid assumption as there is no process intercommunication. Any variant of the PageRank algorithm can be used with our partitioning scheme. For our experiments we chose to use the original, naïve PageRank algorithm. Table 2 summarizes the runtime results for the parallel PageRank computation compared with the same original PageRank. We used a convergence threshold of 10-8 and a dampening factor of 0.1 for all PageRank computations. To examine the experimental accuracy we computed the L1-norm for the PageRank score vectors returned by the two methods, which was found to be 4.4 x 10-4. The small error could arise due to numerical computing issues and may also depend on the number of iterations and convergence rate when computed as a whole graph versus a computation for convergence as a small subgraph. Table 2 Divide and Conquer Approach versus Naive Approach Operation

Average Run Time (in ms)

Largest Red Patch PageRank

20

Yellow Patch Page PageRank

17

DC PageRank

38

Original PageRank

118

DC approach to PR ran 3.1 times faster than original PR . From Table 2 we can see that our proposed parallel PageRank method ran about 3.1 times faster than the original PageRank. Again we point out that this is in spite of using a sub-optimal partitioning algorithm. The results clearly indicate that a combination of any PageRank algorithm and our partitioning

scheme is a more efficient than that same PageRank algorithm by itself. Along with actual runtimes of the parallel and original PageRank, Table 2 also shows actual runtimes for PageRank computation of the largest red patch as well as the yellow patch. To test our incremental PageRank approach on, we performed the experiments on two different web sites- the Computer Science website (http://www.cs.umn.edu) and the Institute of Technology website (http://www.it.umn.edu) at the University of Minnesota. We performed the experiments at different time intervals to study the change and effect of the incremental computation. For the Computer Science website our analysis was done at a time interval of two days, eight days and ten days. A time interval of two days was used for the Institute of technology web site. In our experiments we also simulated the focused crawling, by not considering the Web pages that have very low PageRank into our graph construction and PageRank Computation. This was to emulate the real world scenario where not all pages are crawled. We wanted to analyze, how the incremental approach performs when pages with low PageRank are not crawled. We used the following approximate measure to compare the computational costs of our method versus the naïve method. Number of Times Faster = Num of Iterations(PR)/(1 + (fraction of changed portion)*Number of iterations(IPR)) The intuition behind the measure was how fast the convergence threshold will be reached computing PageRank incrementally versus computing PageRank in a naïve method for the whole graph. The convergence threshold that was chosen on our experiments was 1x10-8 The experimental results are presented in Table 3. These results are from actual experiments conducted on the Computer Science and Institute of Technology websites. For the Computer Science website, in the first time interval of eight days, there seemed to be a significant change in the structure of the Website – about 60% of the pages had changed their link structure. We found out such a sea change occurred because a whole subgraph that contained the documentation for Matlab help was removed. The incremental approach still however, performed 1.86 as much faster as the naïve PageRank. Similarly, for a period of ten days the incremental approach performed around 1.75 times faster. For a period of two days the improvement was 8.65 times faster. These results are for the case of an unfocussed crawl. The results for focused crawl for the CS Website were better. In the first case, when the time interval was eight days, the improvement was 1.9 times and when the time interval was 10 days, the improvement was 1.76 times. For a period of two days the improvement with focused crawling was 9.88 times. Thus, it suggests that focused crawling can also improve the computational costs of the incremental algorithm. The Institute of technology website typically represented a website that doesn’t change too often. The change over a period of two days in the Web Structure was none. Since there was no change detected, there was no necessity to compute the PageRank for the graph at the new time instance. And by our measure, it was 11 times faster. Since, there was no change in the graph structure, the improvements for the case of focused crawling and unfocussed crawling remain the same

Table 3 Comparison of results for Incremental PageRank Algorithm versus Naïve PageRank Algorithm

Computer Science Website Focussed Crawl July19 vs July 27th percentage of change = 53.1429% 10 iteration(s) for inc_pagerank 12 iteration(s) for actual pagerank July 27th vs July 29th percentage of change = 5.25071% 6 iteration(s) for inc_pagerank 13 iteration(s) for actual pagerank July19th vs 29th percentage of change = 58.3493% 10 iteration(s) for inc_pagerank 12 iteration(s) for actual pagerank Unfocussed Crawl July19 vs July 27th percentage of change = 60.2997% 9 iteration(s) for inc_pagerank 12 iteration(s) for actual pagerank July 27th vs July 29th percentage of change = 5.56966% 9 iteration(s) for inc_pagerank 13 iteration(s) for actual pagerank July 19th vs July 29th percentage of change = 65.0586% 9 iteration(s) for inc_pagerank 12 iteration(s) for actual pagerank Institute of Technology Website Unfocussed/Focussed Crawl July 30th vs Aug 1st percentage of change = 0% 0 iteration(s) for inc_pagerank 11 iteration(s) for actual pagerank

L1 -norm : 4.38609e-05

NumTimes faster=

1.900538

L1-norm : 1.60988e-07

NumTimes faster=

9.885481

L1-norm : 4.38692e-05

NumTimes faster=

1.755669

norm : 1.70552e-07

NumTimes faster=

1.867123

norm : 1.51747e-07

NumTimes faster=

8.659162

norm : 1.60377e-07

NumTimes faster=

1.749526

norm : 8.15708e-07

NumTimes faster=

11

6. CONCLUSIONS AND FUTURE DIRECTIONS In this work, we have followed a ‘divide and conquer’ approach to partition a graph in a scheme that will enable efficient computation for measures and metrics based on first order Markov model. We present a theoretical framework to show how such a partitioning scheme can be used to divide the problem into smaller problems. We extend this approach into

another important dimension of evolving graphs and show how such an approach could improve efficiency of computation on large evolving graphs significantly. Our experimental results also show that such an approach even using naive algorithm improves the computation by a significant portion. This approach also leads to other area of interest and research directions to optimize further on computation. As we discussed earlier, the two key issues that need to be considered is an optimal partition and secondly an algorithm to obtain such an

optimal partitioning. Another emerging challenge is the class of pages such as wikipedia and blogs, that change very dynamically. While it is still not clear if PageRank is the right metric for such changing pages, however it should be noted that certain pages change more dynamically. Since different portions of the Web change at different rates, it poses challenges to able to keep the PageRanks updated with optimizing the computation frequency. The study of efficient computation for the changing Web is thus a challenging problem and has a large scope for research to address the various issues.

7. ACKNOWLEDGMENTS This work was supported by Army High Performance Computing Research Center contract number DAAD19-01-20014. The content of the work does not necessarily reflect the position or policy of the government and no official endorsement should be inferred. Access to computing facilities was provided by the AHPCRC and the Minnesota Supercomputing Institute.

8. REFERENCES [1] L. Page, S. Brin, R. Motwani and T. Winograd “The PageRank Citation Ranking: Bringing Order to the Web” Stanford Digital Library Technologies, January 1998. [2] J.M. Kleinberg, “Authoritative Sources in Hyperlinked Environment”, 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 668-667, 1998 [3] P. Desikan, J. Srivastava, V. Kumar, P.-N. Tan, “Hyperlink Analysis – Techniques & Applications”, Army High Performance Computing Center Technical Report, 2002. [4] T. Haveliwala, "Topic-Sensitive PageRank," In Proceedings of 11th International WWW Conference, May 2002. [5] D. Padmanabhan, P. Desikan , J. Srivastava and K. Riaz "WICER: A Weighted Inter-Cluster Edge Ranking for Clustered Graphs", The 2005 IEEE/WIC/ACM International Conference on WI 2005 and IAT 2005 . [6] Taher Haveliwala. "Efficient Computation of PageRank," Stanford University Technical Report, September 1999. [7] Y. Chiang, M. Goodrich, E. Grove, R. Tamassia, D. Vengroff, and J. Vitter. “External-memory graph algorithms”. In Proc. of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, January 1995. [8] Y. Chen, Q. Gan, and T. Suel. I/O-efficient techniques for computing PageRank. In Proc. of the 11th International Conf. on Information and Knowledge Management, pages 549--557, November 2002. [9] A. Arasu, J. Novak, A. Tomkins, and J. Tomlin. PageRank computation and the structure of the web: Experiments and algorithms. In Poster presentation at the 11th Int. World Wide Web Conference, May 2002. [10] S.D. Kamvar, T.H. Haveliwala, Christopher D. Manning, and Gene H. Golub, "Extrapolation Methods for Accelerating PageRank Computations." In Proceedings of the 12th International WWW Conference, May, 2003.

[11] P. Berkhin, “A survey on PageRank computing”, Internet Mathematics, Internet Mathematics, Vol 2 Issue 1(20052006) [12] A. N. Langville and C. D. Meyer, “Deeper inside PageRank”, Internet Mathematics, 1 (2003-4), 335-380. [13] S. Chien, C. Dwork, R. Kumar, D. Sivakumar, D. Simon, “Link evolution: Analysis and algorithms”. First Workshop on Algorithms and Models for the Web-graph 2002. [14] Google, http://www.google.com [15] S. D. Kamvar, T H. Haveliwala, C D. Manning, and G H. Golub, "Exploiting the Block Structure of the Web for Computing PageRank." Preprint, March, 2003 [16] G. Jeh and J. Widom. “Scaling personalized web search.” In 12th Int. World Wide Web Conference, 2003. [17] N. Eiron, K. McCurley, J. Tomlin, “Ranking the Web frontier.”, In: Proc. 13th conference on World Wide Web, ACM Press (2004) 309—318 [18] P. Desikan ,N Pathak, J. Srivastava and V. Kumar "Incremental PageRank Computation on evolving graphs", In: Proc. 14th International World Wide Web Conference on May 10-14, 2005.

Lihat lebih banyak...

Divide and conquer approach for efficient pagerank computation

Descripción

Comentarios