Saturday, January 23, 2010

Iterative process

Recently, I am working on some problems that can be solved by iterative process. Since I am not quite familiar with this type of logic, I met many problems on understanding this process. And when a problem comes out, I really don't know what is the key point reason, much less the solution.

Actually, several iterative processes I am working on these days are very powerful. Computing Google's pagerank, community learning in social network, random walk for getting Youtube's suggestion list, non-negative matrix factorization applied on face identification, text mining, graph clustering, and so on.

Take the pagerank as example, there is an excellent introduction and explanation of Pagerank Tech here.

PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn))

That is the equation to compute page A's pagerank. PR(t1) is the pagerank of webpage t1, and C(t1) is page t1's total number of outbound links. t1...tn are the pages linking to page A. d is a damping factor, which is usually set as 0.85.

So, how can we get the pagerank. Let's consider there are two pages A and B, and they have some connection pattern. We say PR(A) is depended on PR(B), and on the other hand, PR(B) depends on PR(A) too.

PR(A) = (1-d) + d(PR(B/C(B))
PR(B) = (1-d) + d(PR(A/C(A))

This is a catch 22 situation. Pagerank of A depends on accurate pagerank of B, pagerank of B depends on accurate pagerank of A. But initially we just have inaccurate A and B. So, that is why iterative process comes out. We run many iterative computation of A and B until the value of A and B are converged. Here, what I mean convergence may be that a small variation of this iteration's result and last iteration's result, always we can define some threshold.

I am thinking how to compute PageRank on MapReduce.

For example, here is a simple link graph, with the initial pagerank of each node. So, first we use mapreduce to parse the link graph and output the total number of outbound links of each node. Then, we compute pagerank iteratively until convergence. The details are as following: the key value format is .

Map1
Input: link graph data
parsing
Output:
Reduce1
Input: Map1's Output
for each key, counting outbound links
Output:

repeat:
Map2
Input: C> A> A> B>

combining the information: link destination, link source, source's pagerank, number of source's outbound links
Output:
>

Reduce2
Input: Map2's Output
updating pagerank for each key
PR(B)=PR(A)/2+PR(C)/2, PR(C)=PR(A)/2, PR(A)=PR(B)/1+PR(C)/2
Output: > >
until convergence

There is also a MapReduce implementation on computing Pagerank, please see here.

Say something irrelevant to this subject. Yesterday, my advisor in China would like to apply MapReduce to a quite simple information management system, with really simple computation logic (so simple that all the data is hold in a single machine, and all the computation work for a request can be done in miliseconds,). Instead of handling jobs that are with large computation work, they just want to improve this system's scalability when huge connections taken place.

At the beginning, I thought this was not feasible. Even though possible, I thought the performance would be lowered. But another guy gave me a hint, he said making a relay server, and designing a scheduler to distribute these connections to other servers. I guess this be the MapReduce's power. Actually, many problems can be handled by MapReduce.

In this particular case, we can set 1 mapper and n reducer, and the only mapper is the relay server. We can parse the ingress connect session information, and output each request session's initialization time as the key. Then we implement our own Partitioner class to distribute these sessions based on the initialization time. Instead of hash, which is the elements with the same key are hashed to the same destination, another hash is need, which is the elements with the same or nearly initialization time, are hashed to different destinations. Then the reducer as the processing worker to process the request. In this way, when a large number of request are occurred in a short period, it can distribute these request into multiple worker nodes to process it.

In fact, mapreduce will not improve the overall performance more than a simple relay server just having a scheduler. But I widened my knowledge on what MapReduce is.



No comments:

Post a Comment