----- Original Message ----- > From: "HATAYAMA Daisuke" <d.hatayama at jp.fujitsu.com> > To: cfan at redhat.com > Cc: ats-kumagai at wm.jp.nec.com, zhouwj-fnst at cn.fujitsu.com, kexec at lists.infradead.org > Sent: Thursday, December 24, 2015 11:50:08 AM > Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing > > From: Chao Fan <cfan at redhat.com> > Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing > Date: Wed, 23 Dec 2015 22:31:37 -0500 > > > > > > > ----- Original Message ----- > >> From: "HATAYAMA Daisuke" <d.hatayama at jp.fujitsu.com> > >> To: cfan at redhat.com > >> Cc: ats-kumagai at wm.jp.nec.com, zhouwj-fnst at cn.fujitsu.com, > >> kexec at lists.infradead.org > >> Sent: Thursday, December 24, 2015 11:22:28 AM > >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing > >> > >> From: Chao Fan <cfan at redhat.com> > >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing > >> Date: Wed, 23 Dec 2015 21:20:48 -0500 > >> > >> > > >> > > >> > ----- Original Message ----- > >> >> From: "HATAYAMA Daisuke" <d.hatayama at jp.fujitsu.com> > >> >> To: cfan at redhat.com > >> >> Cc: ats-kumagai at wm.jp.nec.com, zhouwj-fnst at cn.fujitsu.com, > >> >> kexec at lists.infradead.org > >> >> Sent: Tuesday, December 22, 2015 4:32:25 PM > >> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing > >> >> > >> >> Chao, > >> >> > >> >> From: Chao Fan <cfan at redhat.com> > >> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing > >> >> Date: Thu, 10 Dec 2015 05:54:28 -0500 > >> >> > >> >> > > >> >> > > >> >> > ----- Original Message ----- > >> >> >> From: "Wenjian Zhou/???" <zhouwj-fnst at cn.fujitsu.com> > >> >> >> To: "Chao Fan" <cfan at redhat.com> > >> >> >> Cc: "Atsushi Kumagai" <ats-kumagai at wm.jp.nec.com>, > >> >> >> kexec at lists.infradead.org > >> >> >> Sent: Thursday, December 10, 2015 6:32:32 PM > >> >> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing > >> >> >> > >> >> >> On 12/10/2015 05:58 PM, Chao Fan wrote: > >> >> >> > > >> >> >> > > >> >> >> > ----- Original Message ----- > >> >> >> >> From: "Wenjian Zhou/???" <zhouwj-fnst at cn.fujitsu.com> > >> >> >> >> To: "Atsushi Kumagai" <ats-kumagai at wm.jp.nec.com> > >> >> >> >> Cc: kexec at lists.infradead.org > >> >> >> >> Sent: Thursday, December 10, 2015 5:36:47 PM > >> >> >> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing > >> >> >> >> > >> >> >> >> On 12/10/2015 04:14 PM, Atsushi Kumagai wrote: > >> >> >> >>>> Hello Kumagai, > >> >> >> >>>> > >> >> >> >>>> On 12/04/2015 10:30 AM, Atsushi Kumagai wrote: > >> >> >> >>>>> Hello, Zhou > >> >> >> >>>>> > >> >> >> >>>>>> On 12/02/2015 03:24 PM, Dave Young wrote: > >> >> >> >>>>>>> Hi, > >> >> >> >>>>>>> > >> >> >> >>>>>>> On 12/02/15 at 01:29pm, "Zhou, Wenjian/???" wrote: > >> >> >> >>>>>>>> I think there is no problem if other test results are as > >> >> >> >>>>>>>> expected. > >> >> >> >>>>>>>> > >> >> >> >>>>>>>> --num-threads mainly reduces the time of compressing. > >> >> >> >>>>>>>> So for lzo, it can't do much help at most of time. > >> >> >> >>>>>>> > >> >> >> >>>>>>> Seems the help of --num-threads does not say it exactly: > >> >> >> >>>>>>> > >> >> >> >>>>>>> [--num-threads THREADNUM]: > >> >> >> >>>>>>> Using multiple threads to read and compress data > >> >> >> >>>>>>> of > >> >> >> >>>>>>> each > >> >> >> >>>>>>> page > >> >> >> >>>>>>> in parallel. > >> >> >> >>>>>>> And it will reduces time for saving DUMPFILE. > >> >> >> >>>>>>> This feature only supports creating DUMPFILE in > >> >> >> >>>>>>> kdump-comressed format from > >> >> >> >>>>>>> VMCORE in kdump-compressed format or elf format. > >> >> >> >>>>>>> > >> >> >> >>>>>>> Lzo is also a compress method, it should be mentioned that > >> >> >> >>>>>>> --num-threads only > >> >> >> >>>>>>> supports zlib compressed vmcore. > >> >> >> >>>>>>> > >> >> >> >>>>>> > >> >> >> >>>>>> Sorry, it seems that something I said is not so clear. > >> >> >> >>>>>> lzo is also supported. Since lzo compresses data at a high > >> >> >> >>>>>> speed, > >> >> >> >>>>>> the > >> >> >> >>>>>> improving of the performance is not so obvious at most of > >> >> >> >>>>>> time. > >> >> >> >>>>>> > >> >> >> >>>>>>> Also worth to mention about the recommended -d value for > >> >> >> >>>>>>> this > >> >> >> >>>>>>> feature. > >> >> >> >>>>>>> > >> >> >> >>>>>> > >> >> >> >>>>>> Yes, I think it's worth. I forgot it. > >> >> >> >>>>> > >> >> >> >>>>> I saw your patch, but I think I should confirm what is the > >> >> >> >>>>> problem > >> >> >> >>>>> first. > >> >> >> >>>>> > >> >> >> >>>>>> However, when "-d 31" is specified, it will be worse. > >> >> >> >>>>>> Less than 50 buffers are used to cache the compressed page. > >> >> >> >>>>>> And even the page has been filtered, it will also take a > >> >> >> >>>>>> buffer. > >> >> >> >>>>>> So if "-d 31" is specified, the filtered page will use a lot > >> >> >> >>>>>> of buffers. Then the page which needs to be compressed can't > >> >> >> >>>>>> be compressed parallel. > >> >> >> >>>>> > >> >> >> >>>>> Could you explain why compression will not be parallel in more > >> >> >> >>>>> detail ? > >> >> >> >>>>> Actually the buffers are used also for filtered pages, it > >> >> >> >>>>> sounds > >> >> >> >>>>> inefficient. > >> >> >> >>>>> However, I don't understand why it prevents parallel > >> >> >> >>>>> compression. > >> >> >> >>>>> > >> >> >> >>>> > >> >> >> >>>> Think about this, in a huge memory, most of the page will be > >> >> >> >>>> filtered, > >> >> >> >>>> and > >> >> >> >>>> we have 5 buffers. > >> >> >> >>>> > >> >> >> >>>> page1 page2 page3 page4 page5 page6 > >> >> >> >>>> page7 > >> >> >> >>>> ..... > >> >> >> >>>> [buffer1] [2] [3] [4] [5] > >> >> >> >>>> unfiltered filtered filtered filtered filtered > >> >> >> >>>> unfiltered > >> >> >> >>>> filtered > >> >> >> >>>> > >> >> >> >>>> Since filtered page will take a buffer, when compressing page1, > >> >> >> >>>> page6 can't be compressed at the same time. > >> >> >> >>>> That why it will prevent parallel compression. > >> >> >> >>> > >> >> >> >>> Thanks for your explanation, I understand. > >> >> >> >>> This is just an issue of the current implementation, there is no > >> >> >> >>> reason to stand this restriction. > >> >> >> >>> > >> >> >> >>>>> Further, according to Chao's benchmark, there is a big > >> >> >> >>>>> performance > >> >> >> >>>>> degradation even if the number of thread is 1. (58s vs 240s) > >> >> >> >>>>> The current implementation seems to have some problems, we > >> >> >> >>>>> should > >> >> >> >>>>> solve them. > >> >> >> >>>>> > >> >> >> >>>> > >> >> >> >>>> If "-d 31" is specified, on the one hand we can't save time by > >> >> >> >>>> compressing > >> >> >> >>>> parallel, on the other hand we will introduce some extra work > >> >> >> >>>> by > >> >> >> >>>> adding > >> >> >> >>>> "--num-threads". So it is obvious that it will have a > >> >> >> >>>> performance > >> >> >> >>>> degradation. > >> >> >> >>> > >> >> >> >>> Sure, there must be some overhead due to "some extra work"(e.g. > >> >> >> >>> exclusive > >> >> >> >>> lock), > >> >> >> >>> but "--num-threads=1 is 4 times slower than --num-threads=0" > >> >> >> >>> still > >> >> >> >>> sounds > >> >> >> >>> too slow, the degradation is too big to be called "some extra > >> >> >> >>> work". > >> >> >> >>> > >> >> >> >>> Both --num-threads=0 and --num-threads=1 are serial processing, > >> >> >> >>> the above "buffer fairness issue" will not be related to this > >> >> >> >>> degradation. > >> >> >> >>> What do you think what make this degradation ? > >> >> >> >>> > >> >> >> >> > >> >> >> >> I can't get such result at this moment, so I can't do some > >> >> >> >> further > >> >> >> >> investigation > >> >> >> >> right now. I guess it may be caused by the underlying > >> >> >> >> implementation > >> >> >> >> of > >> >> >> >> pthread. > >> >> >> >> I reviewed the test result of the patch v2 and found in different > >> >> >> >> machines, > >> >> >> >> the results are quite different. > >> >> >> > > >> >> >> > Hi Zhou Wenjian, > >> >> >> > > >> >> >> > I have done more tests in another machine with 128G memory, and > >> >> >> > get > >> >> >> > the > >> >> >> > result: > >> >> >> > > >> >> >> > the size of vmcore is 300M in "-d 31" > >> >> >> > makedumpfile -l --message-level 1 -d 31: > >> >> >> > time: 8.6s page-faults: 2272 > >> >> >> > > >> >> >> > makedumpfile -l --num-threads 1 --message-level 1 -d 31: > >> >> >> > time: 28.1s page-faults: 2359 > >> >> >> > > >> >> >> > > >> >> >> > and the size of vmcore is 2.6G in "-d 0". > >> >> >> > In this machine, I get the same result as yours: > >> >> >> > > >> >> >> > > >> >> >> > makedumpfile -c --message-level 1 -d 0: > >> >> >> > time: 597s page-faults: 2287 > >> >> >> > > >> >> >> > makedumpfile -c --num-threads 1 --message-level 1 -d 0: > >> >> >> > time: 602s page-faults: 2361 > >> >> >> > > >> >> >> > makedumpfile -c --num-threads 2 --message-level 1 -d 0: > >> >> >> > time: 337s page-faults: 2397 > >> >> >> > > >> >> >> > makedumpfile -c --num-threads 4 --message-level 1 -d 0: > >> >> >> > time: 175s page-faults: 2461 > >> >> >> > > >> >> >> > makedumpfile -c --num-threads 8 --message-level 1 -d 0: > >> >> >> > time: 103s page-faults: 2611 > >> >> >> > > >> >> >> > > >> >> >> > But the machine of my first test is not under my control, should I > >> >> >> > wait > >> >> >> > for > >> >> >> > the first machine to do more tests? > >> >> >> > If there are still some problems in my tests, please tell me. > >> >> >> > > >> >> >> > >> >> >> Thanks a lot for your test, it seems that there is nothing wrong. > >> >> >> And I haven't got any idea about more tests... > >> >> >> > >> >> >> Could you provide the information of your cpu ? > >> >> >> I will do some further investigation later. > >> >> >> > >> >> > > >> >> > OK, of course, here is the information of cpu: > >> >> > > >> >> > # lscpu > >> >> > Architecture: x86_64 > >> >> > CPU op-mode(s): 32-bit, 64-bit > >> >> > Byte Order: Little Endian > >> >> > CPU(s): 48 > >> >> > On-line CPU(s) list: 0-47 > >> >> > Thread(s) per core: 1 > >> >> > Core(s) per socket: 6 > >> >> > Socket(s): 8 > >> >> > NUMA node(s): 8 > >> >> > Vendor ID: AuthenticAMD > >> >> > CPU family: 16 > >> >> > Model: 8 > >> >> > Model name: Six-Core AMD Opteron(tm) Processor 8439 SE > >> >> > Stepping: 0 > >> >> > CPU MHz: 2793.040 > >> >> > BogoMIPS: 5586.22 > >> >> > Virtualization: AMD-V > >> >> > L1d cache: 64K > >> >> > L1i cache: 64K > >> >> > L2 cache: 512K > >> >> > L3 cache: 5118K > >> >> > NUMA node0 CPU(s): 0,8,16,24,32,40 > >> >> > NUMA node1 CPU(s): 1,9,17,25,33,41 > >> >> > NUMA node2 CPU(s): 2,10,18,26,34,42 > >> >> > NUMA node3 CPU(s): 3,11,19,27,35,43 > >> >> > NUMA node4 CPU(s): 4,12,20,28,36,44 > >> >> > NUMA node5 CPU(s): 5,13,21,29,37,45 > >> >> > NUMA node6 CPU(s): 6,14,22,30,38,46 > >> >> > NUMA node7 CPU(s): 7,15,23,31,39,47 > >> >> > >> >> This CPU assignment on NUMA nodes looks interesting. Is it possible > >> >> that this affects performance of makedumpfile? This is just a guess. > >> >> > >> >> Could you check whether the performance gets imporoved if you run each > >> >> thread on the same NUMA node? For example: > >> >> > >> >> # taskset -c 0,8,16,24 makedumpfile --num-threads 4 -c -d 0 vmcore > >> >> vmcore-cd0 > >> >> > >> > Hi HATAYAMA, > >> > > >> > I think your guess is right, but maybe your command has a little > >> > problem. > >> > > >> > From my test, the NUMA did affect the performance, but not too much. > >> > The average time of cpus in the same NUMA node: > >> > # taskset -c 0,8,16,24,32 makedumpfile --num-threads 4 -c -d 0 vmcore > >> > vmcore-cd0 > >> > is 314s > >> > The average time of cpus in different NUMA node: > >> > # taskset -c 2,3,5,6,7 makedumpfile --num-threads 4 -c -d 0 vmcore > >> > vmcore-cd0 > >> > is 354s > >> > > >> > >> Hmm, according to some previous discussion, what we should see here is > >> whether it affects performance of makedumpfile with --num-threads 1 > >> and -d 31. So you should need to compare: > >> > >> # taskset 0,8 makedumpfile --num-threads 1 -c -d 31 vmcore vmcore-d31 > >> > >> with: > >> > >> # taskset 0 makedumpfile -c -d 0 vmcore vmcore-d31 > > I removed -c option wrongly. What I wanted to write is: > > # taskset -c 0,8 makedumpfile --num-threads 1 -d 31 vmcore vmcore-d31 > > and: > > # taskset -c 0 makedumpfile -d 31 vmcore vmcore-d31 > > just in case... > Hi HATAYAMA, the average time of # taskset -c 0,8 makedumpfile --num-threads 1 -d 31 vmcore vmcore-d31 is 33s. the average time of # taskset -c 0 makedumpfile -d 31 vmcore vmcore-d31 is 18s. My test steps: 1. change /etc/kdump/conf with "core_collector makedumpfile -l --message-level 1 -d 31" 2. make a crash 3. cd into the directory of the vmcore made by kdump 4. in the directory of vmcore do # taskset -c 0,8 makedumpfile --num-threads 1 -d 31 vmcore vmcore-d31 or # taskset -c 0 makedumpfile -d 31 vmcore vmcore-d31 if there are there any problems, please tell me. Thanks, Chao Fan > >> > >> Also, I'm assuming that you've done these benchmark on kdump 1st > >> kernel, not kdump 2nd kernel. Is this correct? > >> > > Hi HATAYAMA, > > > > I test in the first kernel, not in the kdump second kernel. > > > > I see. > > -- > Thanks. > HATAYAMA, Daisuke > _______________________________________________ > kexec mailing list > kexec at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec >