Wednesday, December 22, 2010
Friday, December 3, 2010
Tuesday, August 10, 2010
P NP, what is that?
Today, a heavyweight news "p vs np" finally been solved. To feel ashamed, as a computer science phd student, I don't even know what is p and np. So, here I would like to review what is p and np. The following is from http://www.matrix67.com/blog/archives/105
什么是P问题、NP问题和NPC问题
这或许是众多OIer最大的误区之一。
你会经常看到网上出现“这怎么做,这不是NP问题吗”、“这个只有搜了,这已经被证明是NP问题了”之类的话。你要知道,大多数人此时所说的NP问题其实都是指的NPC问题。他们没有搞清楚NP问题和NPC问题的概念。NP问题并不是那种“只有搜才行”的问题,NPC问题才是。好,行了,基本上这个误解已经被澄清了。下面的内容都是在讲什么是P问题,什么是NP问题,什么是NPC问题,你如果不是很感兴趣就可以不看了。接下来你可以看到,把NP问题当成是 NPC问题是一个多大的错误。
还是先用几句话简单说明一下时间复杂度。时间复杂度并不是表示一个程序解决问题需要花多少时间,而是当问题规模扩大后,程序需要的时间长度增长得有多快。也就是说,对于高速处理数据的计算机来说,处理某一个特定数据的效率不能衡量一个程序的好坏,而应该看当这个数据的规模变大到数百倍后,程序运行时间是否还是一样,或者也跟着慢了数百倍,或者变慢了数万倍。不管数据有多大,程序处理花的时间始终是那么多的,我们就说这个程序很好,具有O(1)的时间复杂度,也称常数级复杂度;数据规模变得有多大,花的时间也跟着变得有多长,这个程序的时间复杂度就是O(n),比如找n个数中的最大值;而像冒泡排序、插入排序等,数据扩大2倍,时间变慢4倍的,属于O(n^2)的复杂度。还有一些穷举类的算法,所需时间长度成几何阶数上涨,这就是O(a^n)的指数级复杂度,甚至O(n!)的阶乘级复杂度。不会存在O(2*n^2)的复杂度,因为前面的那个“2”是系数,根本不会影响到整个程序的时间增长。同样地,O (n^3+n^2)的复杂度也就是O(n^3)的复杂度。因此,我们会说,一个O(0.01*n^3)的程序的效率比O(100*n^2)的效率低,尽管在n很小的时候,前者优于后者,但后者时间随数据规模增长得慢,最终O(n^3)的复杂度将远远超过O(n^2)。我们也说,O(n^100)的复杂度小于O(1.01^n)的复杂度。
容易看出,前面的几类复杂度被分为两种级别,其中后者的复杂度无论如何都远远大于前者:一种是O(1),O(log(n)),O(n^a)等,我们把它叫做多项式级的复杂度,因为它的规模n出现在底数的位置;另一种是O(a^n)和O(n!)型复杂度,它是非多项式级的,其复杂度计算机往往不能承受。当我们在解决一个问题时,我们选择的算法通常都需要是多项式级的复杂度,非多项式级的复杂度需要的时间太多,往往会超时,除非是数据规模非常小。
自然地,人们会想到一个问题:会不会所有的问题都可以找到复杂度为多项式级的算法呢?很遗憾,答案是否定的。有些问题甚至根本不可能找到一个正确的算法来,这称之为“不可解问题”(Undecidable Decision Problem)。The Halting Problem就是一个著名的不可解问题,在我的Blog上有过专门的介绍和证明。再比如,输出从1到n这n个数的全排列。不管你用什么方法,你的复杂度都是阶乘级,因为你总得用阶乘级的时间打印出结果来。有人说,这样的“问题”不是一个“正规”的问题,正规的问题是让程序解决一个问题,输出一个“YES”或“NO”(这被称为判定性问题),或者一个什么什么的最优值(这被称为最优化问题)。那么,根据这个定义,我也能举出一个不大可能会有多项式级算法的问题来:Hamilton回路。问题是这样的:给你一个图,问你能否找到一条经过每个顶点一次且恰好一次(不遗漏也不重复)最后又走回来的路(满足这个条件的路径叫做Hamilton回路)。这个问题现在还没有找到多项式级的算法。事实上,这个问题就是我们后面要说的NPC问题。
下面引入P类问题的概念:如果一个问题可以找到一个能在多项式的时间里解决它的算法,那么这个问题就属于P问题。P是英文单词多项式的第一个字母。哪些问题是P类问题呢?通常NOI和NOIP不会出不属于P类问题的题目。我们常见到的一些信息奥赛的题目都是P问题。道理很简单,一个用穷举换来的非多项式级时间的超时程序不会涵盖任何有价值的算法。
接下来引入NP问题的概念。这个就有点难理解了,或者说容易理解错误。在这里强调(回到我竭力想澄清的误区上),NP问题不是非P类问题。NP问题是指可以在多项式的时间里验证一个解的问题。NP问题的另一个定义是,可以在多项式的时间里猜出一个解的问题。比方说,我RP很好,在程序中需要枚举时,我可以一猜一个准。现在某人拿到了一个求最短路径的问题,问从起点到终点是否有一条小于100个单位长度的路线。它根据数据画好了图,但怎么也算不出来,于是来问我:你看怎么选条路走得最少?我说,我RP很好,肯定能随便给你指条很短的路出来。然后我就胡乱画了几条线,说就这条吧。那人按我指的这条把权值加起来一看,嘿,神了,路径长度98,比100小。于是答案出来了,存在比100小的路径。别人会问他这题怎么做出来的,他就可以说,因为我找到了一个比100 小的解。在这个题中,找一个解很困难,但验证一个解很容易。验证一个解只需要O(n)的时间复杂度,也就是说我可以花O(n)的时间把我猜的路径的长度加出来。那么,只要我RP好,猜得准,我一定能在多项式的时间里解决这个问题。我猜到的方案总是最优的,不满足题意的方案也不会来骗我去选它。这就是NP问题。当然有不是NP问题的问题,即你猜到了解但是没用,因为你不能在多项式的时间里去验证它。下面我要举的例子是一个经典的例子,它指出了一个目前还没有办法在多项式的时间里验证一个解的问题。很显然,前面所说的Hamilton回路是NP问题,因为验证一条路是否恰好经过了每一个顶点非常容易。但我要把问题换成这样:试问一个图中是否不存在Hamilton回路。这样问题就没法在多项式的时间里进行验证了,因为除非你试过所有的路,否则你不敢断定它“没有Hamilton回路”。
之所以要定义NP问题,是因为通常只有NP问题才可能找到多项式的算法。我们不会指望一个连多项式地验证一个解都不行的问题存在一个解决它的多项式级的算法。相信读者很快明白,信息学中的号称最困难的问题——“NP问题”,实际上是在探讨NP问题与P类问题的关系。
很显然,所有的P类问题都是NP问题。也就是说,能多项式地解决一个问题,必然能多项式地验证一个问题的解——既然正解都出来了,验证任意给定的解也只需要比较一下就可以了。关键是,人们想知道,是否所有的NP问题都是P类问题。我们可以再用集合的观点来说明。如果把所有P类问题归为一个集合P中,把所有 NP问题划进另一个集合NP中,那么,显然有P属于NP。现在,所有对NP问题的研究都集中在一个问题上,即究竟是否有P=NP?通常所谓的“NP问题”,其实就一句话:证明或推翻P=NP。
NP问题一直都是信息学的巅峰。巅峰,意即很引人注目但难以解决。在信息学研究中,这是一个耗费了很多时间和精力也没有解决的终极问题,好比物理学中的大统一和数学中的歌德巴赫猜想等。
目前为止这个问题还“啃不动”。但是,一个总的趋势、一个大方向是有的。人们普遍认为,P=NP不成立,也就是说,多数人相信,存在至少一个不可能有多项式级复杂度的算法的NP问题。人们如此坚信P≠NP是有原因的,就是在研究NP问题的过程中找出了一类非常特殊的NP问题叫做NP-完全问题,也即所谓的 NPC问题。C是英文单词“完全”的第一个字母。正是NPC问题的存在,使人们相信P≠NP。下文将花大量篇幅介绍NPC问题,你从中可以体会到NPC问题使P=NP变得多么不可思议。
为了说明NPC问题,我们先引入一个概念——约化(Reducibility,有的资料上叫“归约”)。
简单地说,一个问题A可以约化为问题B的含义即是,可以用问题B的解法解决问题A,或者说,问题A可以“变成”问题B。《算法导论》上举了这么一个例子。比如说,现在有两个问题:求解一个一元一次方程和求解一个一元二次方程。那么我们说,前者可以约化为后者,意即知道如何解一个一元二次方程那么一定能解出一元一次方程。我们可以写出两个程序分别对应两个问题,那么我们能找到一个“规则”,按照这个规则把解一元一次方程程序的输入数据变一下,用在解一元二次方程的程序上,两个程序总能得到一样的结果。这个规则即是:两个方程的对应项系数不变,一元二次方程的二次项系数为0。按照这个规则把前一个问题转换成后一个问题,两个问题就等价了。同样地,我们可以说,Hamilton回路可以约化为TSP问题(Travelling Salesman Problem,旅行商问题):在Hamilton回路问题中,两点相连即这两点距离为0,两点不直接相连则令其距离为1,于是问题转化为在TSP问题中,是否存在一条长为0的路径。Hamilton回路存在当且仅当TSP问题中存在长为0的回路。
“问题A可约化为问题B”有一个重要的直观意义:B的时间复杂度高于或者等于A的时间复杂度。也就是说,问题A不比问题B难。这很容易理解。既然问题A能用问题B来解决,倘若B的时间复杂度比A的时间复杂度还低了,那A的算法就可以改进为B的算法,两者的时间复杂度还是相同。正如解一元二次方程比解一元一次方程难,因为解决前者的方法可以用来解决后者。
很显然,约化具有一项重要的性质:约化具有传递性。如果问题A可约化为问题B,问题B可约化为问题C,则问题A一定可约化为问题C。这个道理非常简单,就不必阐述了。
现在再来说一下约化的标准概念就不难理解了:如果能找到这样一个变化法则,对任意一个程序A的输入,都能按这个法则变换成程序B的输入,使两程序的输出相同,那么我们说,问题A可约化为问题B。
当然,我们所说的“可约化”是指的可“多项式地”约化(Polynomial-time Reducible),即变换输入的方法是能在多项式的时间里完成的。约化的过程只有用多项式的时间完成才有意义。
好了,从约化的定义中我们看到,一个问题约化为另一个问题,时间复杂度增加了,问题的应用范围也增大了。通过对某些问题的不断约化,我们能够不断寻找复杂度更高,但应用范围更广的算法来代替复杂度虽然低,但只能用于很小的一类问题的算法。再回想前面讲的P和NP问题,联想起约化的传递性,自然地,我们会想问,如果不断地约化上去,不断找到能“通吃”若干小NP问题的一个稍复杂的大NP问题,那么最后是否有可能找到一个时间复杂度最高,并且能“通吃”所有的 NP问题的这样一个超级NP问题?答案居然是肯定的。也就是说,存在这样一个NP问题,所有的NP问题都可以约化成它。换句话说,只要解决了这个问题,那么所有的NP问题都解决了。这种问题的存在难以置信,并且更加不可思议的是,这种问题不只一个,它有很多个,它是一类问题。这一类问题就是传说中的NPC 问题,也就是NP-完全问题。NPC问题的出现使整个NP问题的研究得到了飞跃式的发展。我们有理由相信,NPC问题是最复杂的问题。再次回到全文开头,我们可以看到,人们想表达一个问题不存在多项式的高效算法时应该说它“属于NPC问题”。此时,我的目的终于达到了,我已经把NP问题和NPC问题区别开了。到此为止,本文已经写了近5000字了,我佩服你还能看到这里来,同时也佩服一下自己能写到这里来。
NPC问题的定义非常简单。同时满足下面两个条件的问题就是NPC问题。首先,它得是一个NP问题;然后,所有的NP问题都可以约化到它。证明一个问题是 NPC问题也很简单。先证明它至少是一个NP问题,再证明其中一个已知的NPC问题能约化到它(由约化的传递性,则NPC问题定义的第二条也得以满足;至于第一个NPC问题是怎么来的,下文将介绍),这样就可以说它是NPC问题了。
既然所有的NP问题都能约化成NPC问题,那么只要任意一个NPC问题找到了一个多项式的算法,那么所有的NP问题都能用这个算法解决了,NP也就等于P 了。因此,给NPC找一个多项式算法太不可思议了。因此,前文才说,“正是NPC问题的存在,使人们相信P≠NP”。我们可以就此直观地理解,NPC问题目前没有多项式的有效算法,只能用指数级甚至阶乘级复杂度的搜索。
顺便讲一下NP-Hard问题。NP-Hard问题是这样一种问题,它满足NPC问题定义的第二条但不一定要满足第一条(就是说,NP-Hard问题要比 NPC问题的范围广)。NP-Hard问题同样难以找到多项式的算法,但它不列入我们的研究范围,因为它不一定是NP问题。即使NPC问题发现了多项式级的算法,NP-Hard问题有可能仍然无法得到多项式级的算法。事实上,由于NP-Hard放宽了限定条件,它将有可能比所有的NPC问题的时间复杂度更高从而更难以解决。
不要以为NPC问题是一纸空谈。NPC问题是存在的。确实有这么一个非常具体的问题属于NPC问题。下文即将介绍它。
下文即将介绍逻辑电路问题。这是第一个NPC问题。其它的NPC问题都是由这个问题约化而来的。因此,逻辑电路问题是NPC类问题的“鼻祖”。
逻辑电路问题是指的这样一个问题:给定一个逻辑电路,问是否存在一种输入使输出为True。
什么叫做逻辑电路呢?一个逻辑电路由若干个输入,一个输出,若干“逻辑门”和密密麻麻的线组成。看下面一例,不需要解释你马上就明白了。
┌───┐
│ 输入1├─→┐ ┌──┐
└───┘ └─→┤ │
│ or ├→─┐
┌───┐ ┌─→┤ │ │ ┌──┐
│ 输入2├─→┤ └──┘ └─→┤ │
└───┘ │ ┌─→┤AND ├──→输出
└────────┘┌→┤ │
┌───┐ ┌──┐ │ └──┘
│ 输入3├─→┤ NOT├─→────┘
└───┘ └──┘
这是个较简单的逻辑电路,当输入1、输入2、输入3分别为True、True、False或False、True、False时,输出为True。
有输出无论如何都不可能为True的逻辑电路吗?有。下面就是一个简单的例子。
┌───┐
│输入1 ├→─┐ ┌──┐
└───┘ └─→┤ │
│AND ├─→┐
┌─→┤ │ │
│ └──┘ │ ┌──┐
│ └→┤ │
┌───┐ │ │AND ├─→输出
│输入2 ├→─┤ ┌──┐ ┌→┤ │
└───┘ └→┤NOT ├→──┘ └──┘
└──┘
上面这个逻辑电路中,无论输入是什么,输出都是False。我们就说,这个逻辑电路不存在使输出为True的一组输入。
回到上文,给定一个逻辑电路,问是否存在一种输入使输出为True,这即逻辑电路问题。
逻辑电路问题属于NPC问题。这是有严格证明的。它显然属于NP问题,并且可以直接证明所有的NP问题都可以约化到它(不要以为NP问题有无穷多个将给证明造成不可逾越的困难)。证明过程相当复杂,其大概意思是说任意一个NP问题的输入和输出都可以转换成逻辑电路的输入和输出(想想计算机内部也不过是一些 0和1的运算),因此对于一个NP问题来说,问题转化为了求出满足结果为True的一个输入(即一个可行解)。
有了第一个NPC问题后,一大堆NPC问题就出现了,因为再证明一个新的NPC问题只需要将一个已知的NPC问题约化到它就行了。后来,Hamilton 回路成了NPC问题,TSP问题也成了NPC问题。现在被证明是NPC问题的有很多,任何一个找到了多项式算法的话所有的NP问题都可以完美解决了。因此说,正是因为NPC问题的存在,P=NP变得难以置信。P=NP问题还有许多有趣的东西,有待大家自己进一步的挖掘。攀登这个信息学的巅峰是我们这一代的终极目标。现在我们需要做的,至少是不要把概念弄混淆了。
什么是P问题、NP问题和NPC问题
这或许是众多OIer最大的误区之一。
你会经常看到网上出现“这怎么做,这不是NP问题吗”、“这个只有搜了,这已经被证明是NP问题了”之类的话。你要知道,大多数人此时所说的NP问题其实都是指的NPC问题。他们没有搞清楚NP问题和NPC问题的概念。NP问题并不是那种“只有搜才行”的问题,NPC问题才是。好,行了,基本上这个误解已经被澄清了。下面的内容都是在讲什么是P问题,什么是NP问题,什么是NPC问题,你如果不是很感兴趣就可以不看了。接下来你可以看到,把NP问题当成是 NPC问题是一个多大的错误。
还是先用几句话简单说明一下时间复杂度。时间复杂度并不是表示一个程序解决问题需要花多少时间,而是当问题规模扩大后,程序需要的时间长度增长得有多快。也就是说,对于高速处理数据的计算机来说,处理某一个特定数据的效率不能衡量一个程序的好坏,而应该看当这个数据的规模变大到数百倍后,程序运行时间是否还是一样,或者也跟着慢了数百倍,或者变慢了数万倍。不管数据有多大,程序处理花的时间始终是那么多的,我们就说这个程序很好,具有O(1)的时间复杂度,也称常数级复杂度;数据规模变得有多大,花的时间也跟着变得有多长,这个程序的时间复杂度就是O(n),比如找n个数中的最大值;而像冒泡排序、插入排序等,数据扩大2倍,时间变慢4倍的,属于O(n^2)的复杂度。还有一些穷举类的算法,所需时间长度成几何阶数上涨,这就是O(a^n)的指数级复杂度,甚至O(n!)的阶乘级复杂度。不会存在O(2*n^2)的复杂度,因为前面的那个“2”是系数,根本不会影响到整个程序的时间增长。同样地,O (n^3+n^2)的复杂度也就是O(n^3)的复杂度。因此,我们会说,一个O(0.01*n^3)的程序的效率比O(100*n^2)的效率低,尽管在n很小的时候,前者优于后者,但后者时间随数据规模增长得慢,最终O(n^3)的复杂度将远远超过O(n^2)。我们也说,O(n^100)的复杂度小于O(1.01^n)的复杂度。
容易看出,前面的几类复杂度被分为两种级别,其中后者的复杂度无论如何都远远大于前者:一种是O(1),O(log(n)),O(n^a)等,我们把它叫做多项式级的复杂度,因为它的规模n出现在底数的位置;另一种是O(a^n)和O(n!)型复杂度,它是非多项式级的,其复杂度计算机往往不能承受。当我们在解决一个问题时,我们选择的算法通常都需要是多项式级的复杂度,非多项式级的复杂度需要的时间太多,往往会超时,除非是数据规模非常小。
自然地,人们会想到一个问题:会不会所有的问题都可以找到复杂度为多项式级的算法呢?很遗憾,答案是否定的。有些问题甚至根本不可能找到一个正确的算法来,这称之为“不可解问题”(Undecidable Decision Problem)。The Halting Problem就是一个著名的不可解问题,在我的Blog上有过专门的介绍和证明。再比如,输出从1到n这n个数的全排列。不管你用什么方法,你的复杂度都是阶乘级,因为你总得用阶乘级的时间打印出结果来。有人说,这样的“问题”不是一个“正规”的问题,正规的问题是让程序解决一个问题,输出一个“YES”或“NO”(这被称为判定性问题),或者一个什么什么的最优值(这被称为最优化问题)。那么,根据这个定义,我也能举出一个不大可能会有多项式级算法的问题来:Hamilton回路。问题是这样的:给你一个图,问你能否找到一条经过每个顶点一次且恰好一次(不遗漏也不重复)最后又走回来的路(满足这个条件的路径叫做Hamilton回路)。这个问题现在还没有找到多项式级的算法。事实上,这个问题就是我们后面要说的NPC问题。
下面引入P类问题的概念:如果一个问题可以找到一个能在多项式的时间里解决它的算法,那么这个问题就属于P问题。P是英文单词多项式的第一个字母。哪些问题是P类问题呢?通常NOI和NOIP不会出不属于P类问题的题目。我们常见到的一些信息奥赛的题目都是P问题。道理很简单,一个用穷举换来的非多项式级时间的超时程序不会涵盖任何有价值的算法。
接下来引入NP问题的概念。这个就有点难理解了,或者说容易理解错误。在这里强调(回到我竭力想澄清的误区上),NP问题不是非P类问题。NP问题是指可以在多项式的时间里验证一个解的问题。NP问题的另一个定义是,可以在多项式的时间里猜出一个解的问题。比方说,我RP很好,在程序中需要枚举时,我可以一猜一个准。现在某人拿到了一个求最短路径的问题,问从起点到终点是否有一条小于100个单位长度的路线。它根据数据画好了图,但怎么也算不出来,于是来问我:你看怎么选条路走得最少?我说,我RP很好,肯定能随便给你指条很短的路出来。然后我就胡乱画了几条线,说就这条吧。那人按我指的这条把权值加起来一看,嘿,神了,路径长度98,比100小。于是答案出来了,存在比100小的路径。别人会问他这题怎么做出来的,他就可以说,因为我找到了一个比100 小的解。在这个题中,找一个解很困难,但验证一个解很容易。验证一个解只需要O(n)的时间复杂度,也就是说我可以花O(n)的时间把我猜的路径的长度加出来。那么,只要我RP好,猜得准,我一定能在多项式的时间里解决这个问题。我猜到的方案总是最优的,不满足题意的方案也不会来骗我去选它。这就是NP问题。当然有不是NP问题的问题,即你猜到了解但是没用,因为你不能在多项式的时间里去验证它。下面我要举的例子是一个经典的例子,它指出了一个目前还没有办法在多项式的时间里验证一个解的问题。很显然,前面所说的Hamilton回路是NP问题,因为验证一条路是否恰好经过了每一个顶点非常容易。但我要把问题换成这样:试问一个图中是否不存在Hamilton回路。这样问题就没法在多项式的时间里进行验证了,因为除非你试过所有的路,否则你不敢断定它“没有Hamilton回路”。
之所以要定义NP问题,是因为通常只有NP问题才可能找到多项式的算法。我们不会指望一个连多项式地验证一个解都不行的问题存在一个解决它的多项式级的算法。相信读者很快明白,信息学中的号称最困难的问题——“NP问题”,实际上是在探讨NP问题与P类问题的关系。
很显然,所有的P类问题都是NP问题。也就是说,能多项式地解决一个问题,必然能多项式地验证一个问题的解——既然正解都出来了,验证任意给定的解也只需要比较一下就可以了。关键是,人们想知道,是否所有的NP问题都是P类问题。我们可以再用集合的观点来说明。如果把所有P类问题归为一个集合P中,把所有 NP问题划进另一个集合NP中,那么,显然有P属于NP。现在,所有对NP问题的研究都集中在一个问题上,即究竟是否有P=NP?通常所谓的“NP问题”,其实就一句话:证明或推翻P=NP。
NP问题一直都是信息学的巅峰。巅峰,意即很引人注目但难以解决。在信息学研究中,这是一个耗费了很多时间和精力也没有解决的终极问题,好比物理学中的大统一和数学中的歌德巴赫猜想等。
目前为止这个问题还“啃不动”。但是,一个总的趋势、一个大方向是有的。人们普遍认为,P=NP不成立,也就是说,多数人相信,存在至少一个不可能有多项式级复杂度的算法的NP问题。人们如此坚信P≠NP是有原因的,就是在研究NP问题的过程中找出了一类非常特殊的NP问题叫做NP-完全问题,也即所谓的 NPC问题。C是英文单词“完全”的第一个字母。正是NPC问题的存在,使人们相信P≠NP。下文将花大量篇幅介绍NPC问题,你从中可以体会到NPC问题使P=NP变得多么不可思议。
为了说明NPC问题,我们先引入一个概念——约化(Reducibility,有的资料上叫“归约”)。
简单地说,一个问题A可以约化为问题B的含义即是,可以用问题B的解法解决问题A,或者说,问题A可以“变成”问题B。《算法导论》上举了这么一个例子。比如说,现在有两个问题:求解一个一元一次方程和求解一个一元二次方程。那么我们说,前者可以约化为后者,意即知道如何解一个一元二次方程那么一定能解出一元一次方程。我们可以写出两个程序分别对应两个问题,那么我们能找到一个“规则”,按照这个规则把解一元一次方程程序的输入数据变一下,用在解一元二次方程的程序上,两个程序总能得到一样的结果。这个规则即是:两个方程的对应项系数不变,一元二次方程的二次项系数为0。按照这个规则把前一个问题转换成后一个问题,两个问题就等价了。同样地,我们可以说,Hamilton回路可以约化为TSP问题(Travelling Salesman Problem,旅行商问题):在Hamilton回路问题中,两点相连即这两点距离为0,两点不直接相连则令其距离为1,于是问题转化为在TSP问题中,是否存在一条长为0的路径。Hamilton回路存在当且仅当TSP问题中存在长为0的回路。
“问题A可约化为问题B”有一个重要的直观意义:B的时间复杂度高于或者等于A的时间复杂度。也就是说,问题A不比问题B难。这很容易理解。既然问题A能用问题B来解决,倘若B的时间复杂度比A的时间复杂度还低了,那A的算法就可以改进为B的算法,两者的时间复杂度还是相同。正如解一元二次方程比解一元一次方程难,因为解决前者的方法可以用来解决后者。
很显然,约化具有一项重要的性质:约化具有传递性。如果问题A可约化为问题B,问题B可约化为问题C,则问题A一定可约化为问题C。这个道理非常简单,就不必阐述了。
现在再来说一下约化的标准概念就不难理解了:如果能找到这样一个变化法则,对任意一个程序A的输入,都能按这个法则变换成程序B的输入,使两程序的输出相同,那么我们说,问题A可约化为问题B。
当然,我们所说的“可约化”是指的可“多项式地”约化(Polynomial-time Reducible),即变换输入的方法是能在多项式的时间里完成的。约化的过程只有用多项式的时间完成才有意义。
好了,从约化的定义中我们看到,一个问题约化为另一个问题,时间复杂度增加了,问题的应用范围也增大了。通过对某些问题的不断约化,我们能够不断寻找复杂度更高,但应用范围更广的算法来代替复杂度虽然低,但只能用于很小的一类问题的算法。再回想前面讲的P和NP问题,联想起约化的传递性,自然地,我们会想问,如果不断地约化上去,不断找到能“通吃”若干小NP问题的一个稍复杂的大NP问题,那么最后是否有可能找到一个时间复杂度最高,并且能“通吃”所有的 NP问题的这样一个超级NP问题?答案居然是肯定的。也就是说,存在这样一个NP问题,所有的NP问题都可以约化成它。换句话说,只要解决了这个问题,那么所有的NP问题都解决了。这种问题的存在难以置信,并且更加不可思议的是,这种问题不只一个,它有很多个,它是一类问题。这一类问题就是传说中的NPC 问题,也就是NP-完全问题。NPC问题的出现使整个NP问题的研究得到了飞跃式的发展。我们有理由相信,NPC问题是最复杂的问题。再次回到全文开头,我们可以看到,人们想表达一个问题不存在多项式的高效算法时应该说它“属于NPC问题”。此时,我的目的终于达到了,我已经把NP问题和NPC问题区别开了。到此为止,本文已经写了近5000字了,我佩服你还能看到这里来,同时也佩服一下自己能写到这里来。
NPC问题的定义非常简单。同时满足下面两个条件的问题就是NPC问题。首先,它得是一个NP问题;然后,所有的NP问题都可以约化到它。证明一个问题是 NPC问题也很简单。先证明它至少是一个NP问题,再证明其中一个已知的NPC问题能约化到它(由约化的传递性,则NPC问题定义的第二条也得以满足;至于第一个NPC问题是怎么来的,下文将介绍),这样就可以说它是NPC问题了。
既然所有的NP问题都能约化成NPC问题,那么只要任意一个NPC问题找到了一个多项式的算法,那么所有的NP问题都能用这个算法解决了,NP也就等于P 了。因此,给NPC找一个多项式算法太不可思议了。因此,前文才说,“正是NPC问题的存在,使人们相信P≠NP”。我们可以就此直观地理解,NPC问题目前没有多项式的有效算法,只能用指数级甚至阶乘级复杂度的搜索。
顺便讲一下NP-Hard问题。NP-Hard问题是这样一种问题,它满足NPC问题定义的第二条但不一定要满足第一条(就是说,NP-Hard问题要比 NPC问题的范围广)。NP-Hard问题同样难以找到多项式的算法,但它不列入我们的研究范围,因为它不一定是NP问题。即使NPC问题发现了多项式级的算法,NP-Hard问题有可能仍然无法得到多项式级的算法。事实上,由于NP-Hard放宽了限定条件,它将有可能比所有的NPC问题的时间复杂度更高从而更难以解决。
不要以为NPC问题是一纸空谈。NPC问题是存在的。确实有这么一个非常具体的问题属于NPC问题。下文即将介绍它。
下文即将介绍逻辑电路问题。这是第一个NPC问题。其它的NPC问题都是由这个问题约化而来的。因此,逻辑电路问题是NPC类问题的“鼻祖”。
逻辑电路问题是指的这样一个问题:给定一个逻辑电路,问是否存在一种输入使输出为True。
什么叫做逻辑电路呢?一个逻辑电路由若干个输入,一个输出,若干“逻辑门”和密密麻麻的线组成。看下面一例,不需要解释你马上就明白了。
┌───┐
│ 输入1├─→┐ ┌──┐
└───┘ └─→┤ │
│ or ├→─┐
┌───┐ ┌─→┤ │ │ ┌──┐
│ 输入2├─→┤ └──┘ └─→┤ │
└───┘ │ ┌─→┤AND ├──→输出
└────────┘┌→┤ │
┌───┐ ┌──┐ │ └──┘
│ 输入3├─→┤ NOT├─→────┘
└───┘ └──┘
这是个较简单的逻辑电路,当输入1、输入2、输入3分别为True、True、False或False、True、False时,输出为True。
有输出无论如何都不可能为True的逻辑电路吗?有。下面就是一个简单的例子。
┌───┐
│输入1 ├→─┐ ┌──┐
└───┘ └─→┤ │
│AND ├─→┐
┌─→┤ │ │
│ └──┘ │ ┌──┐
│ └→┤ │
┌───┐ │ │AND ├─→输出
│输入2 ├→─┤ ┌──┐ ┌→┤ │
└───┘ └→┤NOT ├→──┘ └──┘
└──┘
上面这个逻辑电路中,无论输入是什么,输出都是False。我们就说,这个逻辑电路不存在使输出为True的一组输入。
回到上文,给定一个逻辑电路,问是否存在一种输入使输出为True,这即逻辑电路问题。
逻辑电路问题属于NPC问题。这是有严格证明的。它显然属于NP问题,并且可以直接证明所有的NP问题都可以约化到它(不要以为NP问题有无穷多个将给证明造成不可逾越的困难)。证明过程相当复杂,其大概意思是说任意一个NP问题的输入和输出都可以转换成逻辑电路的输入和输出(想想计算机内部也不过是一些 0和1的运算),因此对于一个NP问题来说,问题转化为了求出满足结果为True的一个输入(即一个可行解)。
有了第一个NPC问题后,一大堆NPC问题就出现了,因为再证明一个新的NPC问题只需要将一个已知的NPC问题约化到它就行了。后来,Hamilton 回路成了NPC问题,TSP问题也成了NPC问题。现在被证明是NPC问题的有很多,任何一个找到了多项式算法的话所有的NP问题都可以完美解决了。因此说,正是因为NPC问题的存在,P=NP变得难以置信。P=NP问题还有许多有趣的东西,有待大家自己进一步的挖掘。攀登这个信息学的巅峰是我们这一代的终极目标。现在我们需要做的,至少是不要把概念弄混淆了。
Wednesday, July 21, 2010
Simulating Secondary Sort on Values with Hadoop
from http://sonerbalkir.blogspot.com/2010/01/simulating-secondary-sort-on-values.html
Sorting and grouping the records come free with Hadoop, it's how map reduce works. In some applications however, you may need to get the values at the reduce step in sorted order. This is not a default behavior and it may be a bit tricky at first glance, since the records coming from the mappers do not follow a deterministic order. I'll describe a sample problem and show how to achieve sorting the values using a trick that is already present in the API.
Consider the following problem: We have a corpus with many documents, and we would like to compute the document weight of each word for every document. We can formulate the document weight of a word w in document d as follows:
D(w,d) = (frequency of w in d) / (frequency of w in the corpus)
For simplicity, let's assume each document has a distinct positive document id, specified as doc_id.
How would you compute the document weight of every word for each document in the corpus by a single map reduce job? Would it be enough to just emit pairs of the form (word, doc_id) and group these records inside the reducers? No, observe that there are two things missing in this approach.
First, for a particulcar word w, we need to compute the denominator(frequency of w in the corpus) to perform any of the computations, and we don't know it yet.
Second, we need the keep track of these corpus frequencies - probably in an in memory data structure like an array or a hash table. What if we have billions of documents?
We obviously need a better solution that's scalable.
Let's get a bit more creative and emit two things inside the mappers.
Notice that for each word, we emit two different pairs where each pair has a key created by concatenating two strings with a # character used for simplicity as a delimiter, word # number
The first record always contains a 0 in its number field, we will use it to compute the denominator. The second field contains the corresponding document's id, which is guaranteed to be a positive number according to the input specifications.
It is now a good time to give an example to visualize things and better understand how the intermediate records look like.
Suppose we have two documents with id's 1 and 2, and let's assume the word 'olive' occurs twice in document 1 and once in document 2. Here is how these documents look like:
.. ... .... olive ..... .. ..
----- document 1-------
and
....... .... ..... olive ...
----- document 2-------
The intermediate records from the mappers might then look like:
olive#0, 1
olive#1, 1
olive#0, 1
olive#1, 1
olive#0, 1
olive#2, 1
and since Hadoop would automatically sort them based on their keys - which in this case are strings - before the reduce step, they would look like:
olive#0, 1
olive#0, 1
olive#0, 1
olive#1, 1
olive#1, 1
olive#2, 1
Observe that this automatic sorting facility gave us an ordered collection of records based on their document id's for free! The only problem we have here right now is the keys olive#0, olive#1, and olive#2 being three different strings and hence they will go to different reducers. We want to guarantee that all the records of the form olive#X will go to the same reducer, namely the olive reducer.
In order to accomplish this, we'll change the default partitioner and overwrite it. The default partitioner works by simply hashing the whole key of each record.
We can override it to make sure that any record of the form olive#X will go to the same reducer.
Observe how we take advantage of the two comparators in the framework. We use the sort comparator class to do a total ordering first, and then implement a custom grouping comparator class to simulate a secondary sort on values by changing the way the groups are created. I'm saying "simulate" because we don't actually perform a second sort on values, they come already sorted!
Finally, our reducer class should look like something similar to this:
public class Reducer {
}
Please keep in mind that I concatenated two strings for the purpose of clarity, but in a real application this may result in performance problems. The best way to emit tuples (multiple attributes) as a key or value is to write custom WritableComparable objects containing multiple fields and implement the required methods including write, read, compareToand equals. Also, using a combiner will obviously reduce the network traffic and boost the performance.
Sorting and grouping the records come free with Hadoop, it's how map reduce works. In some applications however, you may need to get the values at the reduce step in sorted order. This is not a default behavior and it may be a bit tricky at first glance, since the records coming from the mappers do not follow a deterministic order. I'll describe a sample problem and show how to achieve sorting the values using a trick that is already present in the API.
Consider the following problem: We have a corpus with many documents, and we would like to compute the document weight of each word for every document. We can formulate the document weight of a word w in document d as follows:
D(w,d) = (frequency of w in d) / (frequency of w in the corpus)
For simplicity, let's assume each document has a distinct positive document id, specified as doc_id.
How would you compute the document weight of every word for each document in the corpus by a single map reduce job? Would it be enough to just emit pairs of the form (word, doc_id) and group these records inside the reducers? No, observe that there are two things missing in this approach.
First, for a particulcar word w, we need to compute the denominator(frequency of w in the corpus) to perform any of the computations, and we don't know it yet.
Second, we need the keep track of these corpus frequencies - probably in an in memory data structure like an array or a hash table. What if we have billions of documents?
We obviously need a better solution that's scalable.
Let's get a bit more creative and emit two things inside the mappers.
map {
for each word w observed in document doc_id {
emit <word#0, 1>
emit <word#doc_id, 1>
}
}Notice that for each word, we emit two different pairs where each pair has a key created by concatenating two strings with a # character used for simplicity as a delimiter, word # number
The first record always contains a 0 in its number field, we will use it to compute the denominator. The second field contains the corresponding document's id, which is guaranteed to be a positive number according to the input specifications.
It is now a good time to give an example to visualize things and better understand how the intermediate records look like.
Suppose we have two documents with id's 1 and 2, and let's assume the word 'olive' occurs twice in document 1 and once in document 2. Here is how these documents look like:
------------------------
.... olive ... .. . ... .... ... ... .... olive ..... .. ..
----- document 1-------
and
------------------------
..... ..... .... . ... .. .. ....... .... ..... olive ...
----- document 2-------
The intermediate records from the mappers might then look like:
olive#0, 1
olive#1, 1
olive#0, 1
olive#1, 1
olive#0, 1
olive#2, 1
and since Hadoop would automatically sort them based on their keys - which in this case are strings - before the reduce step, they would look like:
olive#0, 1
olive#0, 1
olive#0, 1
olive#1, 1
olive#1, 1
olive#2, 1
Observe that this automatic sorting facility gave us an ordered collection of records based on their document id's for free! The only problem we have here right now is the keys olive#0, olive#1, and olive#2 being three different strings and hence they will go to different reducers. We want to guarantee that all the records of the form olive#X will go to the same reducer, namely the olive reducer.
In order to accomplish this, we'll change the default partitioner and overwrite it. The default partitioner works by simply hashing the whole key of each record.
int partition(record ) {
return hash(key) % N ;
return hash(key) % N ;
// where N is the number of reducers
}
}
We can override it to make sure that any record of the form olive#X will go to the same reducer.
int partition(record ) {
string real_key = key.substring(0,last index of '#');
return hash(real_key) % N;
}
return hash(real_key) % N;
}
You can use org.apache.hadoop.mapreduce.Job's setPartitionerClass
Before we go any further, I should probably mention the comparators used by the Hadoop framework for a map/reduce job. There are two comparators used during different stages of the computation:
Sort Comparator Class: Used to control how the keys are sorted before the reduce step.
Grouping Comparator Class: Used to control which keys are a single call to reduce.
The default behavior is to use the same comparator class for both, but this can be customized via Job'ssetSortComparatorClass and setGroupingComparatorClass methods respectively.
Now back to our problem...
Overriding the partitioner ensures that all of the olive#X records will go to the same reducer, but it doesn't specify the order by which the groups are created. For example, we know olive#0, olive#1 and olive#2 will go to the same reducer, but they are still treated as part of different groups, because the default grouping comparator -org.apache.hadoop.io.TextComparator - is used to decide which keys are a single call to reduce.
In order to resolve this tiny glitch, all we have to do is to write our own grouping comparator class which extendsTextComparator and overrides the compare() method so that it only takes the first part of the key (up to #) into consideration. I'll skip the code for this step since it is more or less the same thing we did for the partitioner.
By using a custom partitioner and overriding the default sort comparator class, we guarantee that all of the olive#Xrecords will go to the same reducer group, and they will be ordered by their document id's. There will be one group for each word w, and all of the document id's that contain w will be processed by the same reducer starting from doc_id = 0 (which will be used to compute the denominator).
method to use a custom partitioner with a map reduce job.Before we go any further, I should probably mention the comparators used by the Hadoop framework for a map/reduce job. There are two comparators used during different stages of the computation:
Sort Comparator Class: Used to control how the keys are sorted before the reduce step.
Grouping Comparator Class: Used to control which keys are a single call to reduce.
The default behavior is to use the same comparator class for both, but this can be customized via Job'ssetSortComparatorClass and setGroupingComparatorClass methods respectively.
Now back to our problem...
Overriding the partitioner ensures that all of the olive#X records will go to the same reducer, but it doesn't specify the order by which the groups are created. For example, we know olive#0, olive#1 and olive#2 will go to the same reducer, but they are still treated as part of different groups, because the default grouping comparator -org.apache.hadoop.io.TextComparator - is used to decide which keys are a single call to reduce.
In order to resolve this tiny glitch, all we have to do is to write our own grouping comparator class which extendsTextComparator and overrides the compare() method so that it only takes the first part of the key (up to #) into consideration. I'll skip the code for this step since it is more or less the same thing we did for the partitioner.
Observe how we take advantage of the two comparators in the framework. We use the sort comparator class to do a total ordering first, and then implement a custom grouping comparator class to simulate a secondary sort on values by changing the way the groups are created. I'm saying "simulate" because we don't actually perform a second sort on values, they come already sorted!
Finally, our reducer class should look like something similar to this:
public class Reducer {
long denominator;
public void reduce(key, values) {
string word = key.substring(0, last index of '#');
string doc_id = key.substring(last index of '#' + 1, key.length);
if (doc_id == 0 ) {
// compute the denominator for this word
denominator = 0;
// compute the denominator for this word
denominator = 0;
for each v in values
denominator += v;
} else {
// we already know the denominator for this word, use it
// we already know the denominator for this word, use it
long sum = 0;
for each v in values
sum += v;
emit(key, sum / denominator);
}}
Please keep in mind that I concatenated two strings for the purpose of clarity, but in a real application this may result in performance problems. The best way to emit tuples (multiple attributes) as a key or value is to write custom WritableComparable objects containing multiple fields and implement the required methods including write, read, compareToand equals. Also, using a combiner will obviously reduce the network traffic and boost the performance.
Tuesday, June 29, 2010
eigenvector and Pagerank
The dominant or principal eigenvector of a matrix is an eigenvector corresponding to the eigenvalue of the largest magnitude of that matrix.
Today, I just read a passage by chance talking about using power method to compute eigenvector. Power method is also called vector iteration, which suddenly struck me on pagerank's computation. here is the pagerank formation:
$R^(k+1)=dWR^k+(1-d)E$
and then let's look at how to use power method to compute it. from here.
Eigenvalues can be ordered in terms of their absolute values to find the dominant or largest eigenvalue of a matrix. Thus, if two distinct hypothetical matrices have the following set of eigenvalues
In its basic form the Power Method is applied as follows:
Today, I just read a passage by chance talking about using power method to compute eigenvector. Power method is also called vector iteration, which suddenly struck me on pagerank's computation. here is the pagerank formation:
$R^(k+1)=dWR^k+(1-d)E$
and then let's look at how to use power method to compute it. from here.
Eigenvalues can be ordered in terms of their absolute values to find the dominant or largest eigenvalue of a matrix. Thus, if two distinct hypothetical matrices have the following set of eigenvalues
- 5,8,-7: then |8|>|-7|>|5| and 8 is the dominant eigenvalue.
- 0.2,-1,1: then |1|=|-1|>|0.2| and since |1|=|-1| there is no dominant eigenvalue.
In its basic form the Power Method is applied as follows:
- Asign to the candidate matrix an arbitrary eigenvector with at least one element being nonzero.
- Compute a new eigenvector.
- Normalize the eigenvector, where the normalization scalar is taken for an initial eigenvalue.
- Multiply the original matrix by the normalized eigenvector to calculate a new eigenvector.
- Normalize this eigenvector, where the normalization scalar is taken for a new eigenvalue.
- Repeat the entire process until the absolute relative error between successive eigenvalues satisfies an arbitrary tolerance (threshold) value.
What we have done here is apply repeatedly a matrix to an arbitrarily chosen eigenvector. The result converges nicely to the largest eigenvalue of the matrix; i.e.
Equation 6: AkXi = cik*Xi
Figure 7 provides a visual representation of the iteration process obtained through the Power Method for the matrix given in Figure 3. As expected, for its largest eigenvalue the iterated vector converges to an eigenvector of relative coordinates (1, 0.20).
Figure 7. Visual representation of vector iteration.
It can be demonstrated that guessing an initial eigenvector in which its first element is 1 and all others are zero produces in the next iteration step an eigenvector with elements being the first column of the matrix. Thus, one could simply choose the first column of a matrix as an initial seed.
Whether you want to try a matrix column as an initial seed, keep in mind that the rate of convergence of the power method actually depends on the nature of the eigenvalues. For closely spaced eigenvalues, the rate of convergence can be slow. Several methods for improving the rate of convergence have been proposed (Shifted Iteration, Shifted Inverse Iteration or transformation methods).
For more information about power method, look at here.
This method is so similar to Pagerank. Actually, for pagerank, we compute a web graph's dominant eigenvector, which represents the graph. It reduces the dimension from n to 1, n is the number of web pages. So the dominant eigenvector has n elements, each represents a web page, and the value indicates the importance of that page.
Saturday, June 19, 2010
Hiking around Quabbin Reservoir
A very interesting hiking day spent in Quabbin Reservoir. We started off at 10:30am. Although spending a lot of time on finding the correct entrance, it didn't affect our mood. Actually my muscle is quite tight and sour for yesterday's training, but anyway it is still OK for a strong man.
We also played killer game. A day with great fun.
Monday, June 14, 2010
fwd: Ten Commands Every Linux Developer Should Know
some quite useful commands you should use, Ten Commands Every Linux Developer Should Know
Sunday, June 13, 2010
Yale trip
June 12th-14th, Yale Medical School Campus. Sadly, I am an experiment sample for a physical muscle research. We arrived at New Haven at 9pm, 12th, spent some time on surfing Internet before going to bed. What a tragedy, I lost my sleep in the first night, this is a common case for me when traveling to a new place... Getting up so early at 6am in the second morning, and then I stayed in the MRI hole for another 2 hours. Fortunately, this time I slept well. Followed by another MRI scan for 1 hour, I am done for the first half day, in the afternoon, a training session waiting for me.
1:00pm-2:45pm, I was supposed to be preparing for the later training session. But I asked to Ryan to let me get out and walk around during this time. Oh, year, I was on the way to Yale main campus. The followings are the photos what I took on campus...
This video was taken when I was in the square. I like this place, very peaceful, very quite...
I am on that square. Take a notice the building behind me, it is the symbol building of Yale.
I don't even know this building's name...anyway, just take it, never regret.
It should be auditorium or something...um..probably..
Library, I know this. You may notice that the windows are made by marble instead of glass. it is said that, the sunshine can filtered through marble, which make Yale University's library quite special...
In Memory of the Men of Yale who true to Her Traditions gave their Lives that Freedom might not perish from the Earth. 1914 Anno Domini 1918. ...
I like this style...
It feels like Tsinghua University's lawn square, isn't it?
Famous Yale's water table...
1:00pm-2:45pm, I was supposed to be preparing for the later training session. But I asked to Ryan to let me get out and walk around during this time. Oh, year, I was on the way to Yale main campus. The followings are the photos what I took on campus...
New Haven hotel is just in front of my accommodation.
I don't know what it is, it seems famous...
I am on a large square, this man's status looks very popular...
I don't know what it is, it seems famous...
I am on a large square, this man's status looks very popular...
This video was taken when I was in the square. I like this place, very peaceful, very quite...
I am on that square. Take a notice the building behind me, it is the symbol building of Yale.
I don't even know this building's name...anyway, just take it, never regret.
It should be auditorium or something...um..probably..
Library, I know this. You may notice that the windows are made by marble instead of glass. it is said that, the sunshine can filtered through marble, which make Yale University's library quite special...
In Memory of the Men of Yale who true to Her Traditions gave their Lives that Freedom might not perish from the Earth. 1914 Anno Domini 1918. ...
I like this style...
It feels like Tsinghua University's lawn square, isn't it?
Famous Yale's water table...
nice gallery...
Friday, June 11, 2010
This is Brothers -- Celtics has a brilliant game over Lakers
OH MY GOD, what a great victory!!!
When substitute Celtics rout luxurious Lakers, I saw "this is BASKETBALL". It is five, it is passion, it is combatant spirit, they are fighters. At that time, I think they are the best symbol of basketball, they give us a best explanation of basketball.
From the beginning of fourth quarter to the last 2 minutes, most time of the last quarter, Celtics substitute players (except for Ray Allen), had a brilliant performance, totally over Lakers, over Kobe. Big baby is so passionate that everybody had to love him. Robinson's great job out-played his faults. Tony Allen's impenetrable defense makes Kobe dark. Wallace's passion is no weaker than big baby, and he gave Lakers a unbelievable 3 points shoot. Ray Allen is a calm stone, his experience and coolness was done to a turn.
Beginning with this game, I love Celtics to be a fan of them.
When substitute Celtics rout luxurious Lakers, I saw "this is BASKETBALL". It is five, it is passion, it is combatant spirit, they are fighters. At that time, I think they are the best symbol of basketball, they give us a best explanation of basketball.
From the beginning of fourth quarter to the last 2 minutes, most time of the last quarter, Celtics substitute players (except for Ray Allen), had a brilliant performance, totally over Lakers, over Kobe. Big baby is so passionate that everybody had to love him. Robinson's great job out-played his faults. Tony Allen's impenetrable defense makes Kobe dark. Wallace's passion is no weaker than big baby, and he gave Lakers a unbelievable 3 points shoot. Ray Allen is a calm stone, his experience and coolness was done to a turn.
Beginning with this game, I love Celtics to be a fan of them.
Thursday, June 3, 2010
Finding Memory Leaks in Java Apps
Here is a small HOWTO on how to find memory leaks with Java SE.
I’ve written it while trying to find memory leaks in our testing tools: JTHarness and ME Framework, and then wanted to share the HOWTO with the world, but I didn’t have my blog at that time, so I posted this info as a comment to a relevant entry in the excellent Adam Bien’s blog.
Note: Use the latest JDK 6, because it has the latest tools, with lots of bug fixes and improvements. All the later examples assume that JDK6’s bin directory is in the PATH.
For example, if you open some documents in your app, the memory graph could rapidly go up. If closing the docs and invocation of full garbage collection did not bring the memory back to normal level, there is probably a leak somewhere.You might use
Alternatively, if you started java with hprof agent, you could just use Ctrl-\ on Solaris/Linux or Ctrl-Break on Windows to dump heap into the file, specified in hprof agent arguments.
Jhat allows you to see what objects are present in the heap, who has references to those objects, etc.
Here are some tips:
I’ve written it while trying to find memory leaks in our testing tools: JTHarness and ME Framework, and then wanted to share the HOWTO with the world, but I didn’t have my blog at that time, so I posted this info as a comment to a relevant entry in the excellent Adam Bien’s blog.
Note: Use the latest JDK 6, because it has the latest tools, with lots of bug fixes and improvements. All the later examples assume that JDK6’s bin directory is in the PATH.
Step 1. Start the application.
Start the application as you usually do:java -jar java_app.jarAlternatively, you could start java with hprof agent. Java will run slower, but the huge benefit of this approach is that the stack traces for created objects will be available which improves memory leak analysis greatly:
java -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9000 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -agentlib:hprof=heap=dump,file=/tmp/hprof.bin, format=b,depth=10 -jar java_app.jarWhen the application is up, perform various actions that you think might lead to memory leaks.
For example, if you open some documents in your app, the memory graph could rapidly go up. If closing the docs and invocation of full garbage collection did not bring the memory back to normal level, there is probably a leak somewhere.You might use
jconsole
from JDK 6 to see the memory consumption graph to have a clue whether memory leak is present or not:jconsoleIt will pop up a dialog with a list of Java applications to connect to. Find the one with java_app.jar and connect. Also, jconsole allows to invoke full GC providing nice button just for that.
Step 2. Find the application pid.
Find out the application’s process id via:jpsIt will print something like:
15976 java_app.jarIn our case the pid is 15976.
7586 startup.jar
22476 Jps
12248 Main
5437 Bootstrap
Step 3. Dump the heap into file.
Dump heap into the file:jmap -dump:format=b,file=/tmp/java_app-heap.bin 15976We just told jmap to dump the heap into /tmp/java_app-heap.bin file, in binary from (which is optimized to work with large heaps). The third parameter is the pid we found in Step 2.
Alternatively, if you started java with hprof agent, you could just use Ctrl-\ on Solaris/Linux or Ctrl-Break on Windows to dump heap into the file, specified in hprof agent arguments.
Step 4. Visualize the heap.
Use jhat tool to visualize the heap:jhat -J-Xmx326m /tmp/java_app-heap.binJhat will parse the heap dump and start a web server at port 7000. Connect to Jhat server by pointing your browser to:
http://localhost:7000And start investigating.
Jhat allows you to see what objects are present in the heap, who has references to those objects, etc.
Here are some tips:
- Investigate _instances_, not _classes_.
- Use the following URL to see the instances: http://localhost:7000/showInstanceCounts/
- Use “Reference Chains from Rootset” (Exclude weak refs!!!) to see who’s holding the instance.
This entry was posted on Monday, April 2nd, 2007 at 1:56 am and is filed under Java. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
Wednesday, May 26, 2010
Taken on UMass Commencement day
It was ten days ago...
Umass commencement day...since Amherst is a university town, when commencement day comes, it seems more people than ever...This video clip is taken when we was passing Mullins center..a lot of students and their parents outside...
The other video clip is taken on the way to waltmart, it is very common American scene, just because it is common, I took it...maybe several years later when I am back China, I will be really like this scene...
Wednesday, May 12, 2010
Seasons - almost a year in Amherst
Since last August, I have been Amherst for almost a full year. These photos were taken during the last months. From summer, to autumn, to winter, to current spring, if I never take these photos, I even don't notice how time flies. If I never take these photos, I even can not remember these changing weathers, and beautiful memories. Anyway, wish the following days of the first year at Amherst, I can have a research harvest. Hope my research result would turn to a paper published... So I can enjoy American summer happily.
Subscribe to:
Posts (Atom)