[PATCH RFC 00/11] makedumpfile: parallel processing

cfan@xxxxxxxxxx (Chao Fan) · Thu, 24 Dec 2015 01:02:38 -0500 (EST)

----- Original Message -----
> From: "HATAYAMA Daisuke" <d.hatayama at jp.fujitsu.com>
> To: cfan at redhat.com
> Cc: ats-kumagai at wm.jp.nec.com, zhouwj-fnst at cn.fujitsu.com, kexec at lists.infradead.org
> Sent: Thursday, December 24, 2015 11:50:08 AM
> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> 
> From: Chao Fan <cfan at redhat.com>
> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> Date: Wed, 23 Dec 2015 22:31:37 -0500
> 
> > 
> > 
> > ----- Original Message -----
> >> From: "HATAYAMA Daisuke" <d.hatayama at jp.fujitsu.com>
> >> To: cfan at redhat.com
> >> Cc: ats-kumagai at wm.jp.nec.com, zhouwj-fnst at cn.fujitsu.com,
> >> kexec at lists.infradead.org
> >> Sent: Thursday, December 24, 2015 11:22:28 AM
> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> >> 
> >> From: Chao Fan <cfan at redhat.com>
> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> >> Date: Wed, 23 Dec 2015 21:20:48 -0500
> >> 
> >> > 
> >> > 
> >> > ----- Original Message -----
> >> >> From: "HATAYAMA Daisuke" <d.hatayama at jp.fujitsu.com>
> >> >> To: cfan at redhat.com
> >> >> Cc: ats-kumagai at wm.jp.nec.com, zhouwj-fnst at cn.fujitsu.com,
> >> >> kexec at lists.infradead.org
> >> >> Sent: Tuesday, December 22, 2015 4:32:25 PM
> >> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> >> >> 
> >> >> Chao,
> >> >> 
> >> >> From: Chao Fan <cfan at redhat.com>
> >> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> >> >> Date: Thu, 10 Dec 2015 05:54:28 -0500
> >> >> 
> >> >> > 
> >> >> > 
> >> >> > ----- Original Message -----
> >> >> >> From: "Wenjian Zhou/???" <zhouwj-fnst at cn.fujitsu.com>
> >> >> >> To: "Chao Fan" <cfan at redhat.com>
> >> >> >> Cc: "Atsushi Kumagai" <ats-kumagai at wm.jp.nec.com>,
> >> >> >> kexec at lists.infradead.org
> >> >> >> Sent: Thursday, December 10, 2015 6:32:32 PM
> >> >> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> >> >> >> 
> >> >> >> On 12/10/2015 05:58 PM, Chao Fan wrote:
> >> >> >> >
> >> >> >> >
> >> >> >> > ----- Original Message -----
> >> >> >> >> From: "Wenjian Zhou/???" <zhouwj-fnst at cn.fujitsu.com>
> >> >> >> >> To: "Atsushi Kumagai" <ats-kumagai at wm.jp.nec.com>
> >> >> >> >> Cc: kexec at lists.infradead.org
> >> >> >> >> Sent: Thursday, December 10, 2015 5:36:47 PM
> >> >> >> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> >> >> >> >>
> >> >> >> >> On 12/10/2015 04:14 PM, Atsushi Kumagai wrote:
> >> >> >> >>>> Hello Kumagai,
> >> >> >> >>>>
> >> >> >> >>>> On 12/04/2015 10:30 AM, Atsushi Kumagai wrote:
> >> >> >> >>>>> Hello, Zhou
> >> >> >> >>>>>
> >> >> >> >>>>>> On 12/02/2015 03:24 PM, Dave Young wrote:
> >> >> >> >>>>>>> Hi,
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> On 12/02/15 at 01:29pm, "Zhou, Wenjian/???" wrote:
> >> >> >> >>>>>>>> I think there is no problem if other test results are as
> >> >> >> >>>>>>>> expected.
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> --num-threads mainly reduces the time of compressing.
> >> >> >> >>>>>>>> So for lzo, it can't do much help at most of time.
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> Seems the help of --num-threads does not say it exactly:
> >> >> >> >>>>>>>
> >> >> >> >>>>>>>       [--num-threads THREADNUM]:
> >> >> >> >>>>>>>           Using multiple threads to read and compress data
> >> >> >> >>>>>>>           of
> >> >> >> >>>>>>>           each
> >> >> >> >>>>>>>           page
> >> >> >> >>>>>>>           in parallel.
> >> >> >> >>>>>>>           And it will reduces time for saving DUMPFILE.
> >> >> >> >>>>>>>           This feature only supports creating DUMPFILE in
> >> >> >> >>>>>>>           kdump-comressed format from
> >> >> >> >>>>>>>           VMCORE in kdump-compressed format or elf format.
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> Lzo is also a compress method, it should be mentioned that
> >> >> >> >>>>>>> --num-threads only
> >> >> >> >>>>>>> supports zlib compressed vmcore.
> >> >> >> >>>>>>>
> >> >> >> >>>>>>
> >> >> >> >>>>>> Sorry, it seems that something I said is not so clear.
> >> >> >> >>>>>> lzo is also supported. Since lzo compresses data at a high
> >> >> >> >>>>>> speed,
> >> >> >> >>>>>> the
> >> >> >> >>>>>> improving of the performance is not so obvious at most of
> >> >> >> >>>>>> time.
> >> >> >> >>>>>>
> >> >> >> >>>>>>> Also worth to mention about the recommended -d value for
> >> >> >> >>>>>>> this
> >> >> >> >>>>>>> feature.
> >> >> >> >>>>>>>
> >> >> >> >>>>>>
> >> >> >> >>>>>> Yes, I think it's worth. I forgot it.
> >> >> >> >>>>>
> >> >> >> >>>>> I saw your patch, but I think I should confirm what is the
> >> >> >> >>>>> problem
> >> >> >> >>>>> first.
> >> >> >> >>>>>
> >> >> >> >>>>>> However, when "-d 31" is specified, it will be worse.
> >> >> >> >>>>>> Less than 50 buffers are used to cache the compressed page.
> >> >> >> >>>>>> And even the page has been filtered, it will also take a
> >> >> >> >>>>>> buffer.
> >> >> >> >>>>>> So if "-d 31" is specified, the filtered page will use a lot
> >> >> >> >>>>>> of buffers. Then the page which needs to be compressed can't
> >> >> >> >>>>>> be compressed parallel.
> >> >> >> >>>>>
> >> >> >> >>>>> Could you explain why compression will not be parallel in more
> >> >> >> >>>>> detail ?
> >> >> >> >>>>> Actually the buffers are used also for filtered pages, it
> >> >> >> >>>>> sounds
> >> >> >> >>>>> inefficient.
> >> >> >> >>>>> However, I don't understand why it prevents parallel
> >> >> >> >>>>> compression.
> >> >> >> >>>>>
> >> >> >> >>>>
> >> >> >> >>>> Think about this, in a huge memory, most of the page will be
> >> >> >> >>>> filtered,
> >> >> >> >>>> and
> >> >> >> >>>> we have 5 buffers.
> >> >> >> >>>>
> >> >> >> >>>> page1       page2      page3     page4     page5      page6
> >> >> >> >>>> page7
> >> >> >> >>>> .....
> >> >> >> >>>> [buffer1]   [2]        [3]       [4]       [5]
> >> >> >> >>>> unfiltered  filtered   filtered  filtered  filtered
> >> >> >> >>>> unfiltered
> >> >> >> >>>> filtered
> >> >> >> >>>>
> >> >> >> >>>> Since filtered page will take a buffer, when compressing page1,
> >> >> >> >>>> page6 can't be compressed at the same time.
> >> >> >> >>>> That why it will prevent parallel compression.
> >> >> >> >>>
> >> >> >> >>> Thanks for your explanation, I understand.
> >> >> >> >>> This is just an issue of the current implementation, there is no
> >> >> >> >>> reason to stand this restriction.
> >> >> >> >>>
> >> >> >> >>>>> Further, according to Chao's benchmark, there is a big
> >> >> >> >>>>> performance
> >> >> >> >>>>> degradation even if the number of thread is 1. (58s vs 240s)
> >> >> >> >>>>> The current implementation seems to have some problems, we
> >> >> >> >>>>> should
> >> >> >> >>>>> solve them.
> >> >> >> >>>>>
> >> >> >> >>>>
> >> >> >> >>>> If "-d 31" is specified, on the one hand we can't save time by
> >> >> >> >>>> compressing
> >> >> >> >>>> parallel, on the other hand we will introduce some extra work
> >> >> >> >>>> by
> >> >> >> >>>> adding
> >> >> >> >>>> "--num-threads". So it is obvious that it will have a
> >> >> >> >>>> performance
> >> >> >> >>>> degradation.
> >> >> >> >>>
> >> >> >> >>> Sure, there must be some overhead due to "some extra work"(e.g.
> >> >> >> >>> exclusive
> >> >> >> >>> lock),
> >> >> >> >>> but "--num-threads=1 is 4 times slower than --num-threads=0"
> >> >> >> >>> still
> >> >> >> >>> sounds
> >> >> >> >>> too slow, the degradation is too big to be called "some extra
> >> >> >> >>> work".
> >> >> >> >>>
> >> >> >> >>> Both --num-threads=0 and --num-threads=1 are serial processing,
> >> >> >> >>> the above "buffer fairness issue" will not be related to this
> >> >> >> >>> degradation.
> >> >> >> >>> What do you think what make this degradation ?
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >> I can't get such result at this moment, so I can't do some
> >> >> >> >> further
> >> >> >> >> investigation
> >> >> >> >> right now. I guess it may be caused by the underlying
> >> >> >> >> implementation
> >> >> >> >> of
> >> >> >> >> pthread.
> >> >> >> >> I reviewed the test result of the patch v2 and found in different
> >> >> >> >> machines,
> >> >> >> >> the results are quite different.
> >> >> >> >
> >> >> >> > Hi Zhou Wenjian,
> >> >> >> >
> >> >> >> > I have done more tests in another machine with 128G memory, and
> >> >> >> > get
> >> >> >> > the
> >> >> >> > result:
> >> >> >> >
> >> >> >> > the size of vmcore is 300M in "-d 31"
> >> >> >> > makedumpfile -l --message-level 1 -d 31:
> >> >> >> > time: 8.6s      page-faults: 2272
> >> >> >> >
> >> >> >> > makedumpfile -l --num-threads 1 --message-level 1 -d 31:
> >> >> >> > time: 28.1s     page-faults: 2359
> >> >> >> >
> >> >> >> >
> >> >> >> > and the size of vmcore is 2.6G in "-d 0".
> >> >> >> > In this machine, I get the same result as yours:
> >> >> >> >
> >> >> >> >
> >> >> >> > makedumpfile -c --message-level 1 -d 0:
> >> >> >> > time: 597s      page-faults: 2287
> >> >> >> >
> >> >> >> > makedumpfile -c --num-threads 1 --message-level 1 -d 0:
> >> >> >> > time: 602s      page-faults: 2361
> >> >> >> >
> >> >> >> > makedumpfile -c --num-threads 2 --message-level 1 -d 0:
> >> >> >> > time: 337s      page-faults: 2397
> >> >> >> >
> >> >> >> > makedumpfile -c --num-threads 4 --message-level 1 -d 0:
> >> >> >> > time: 175s      page-faults: 2461
> >> >> >> >
> >> >> >> > makedumpfile -c --num-threads 8 --message-level 1 -d 0:
> >> >> >> > time: 103s      page-faults: 2611
> >> >> >> >
> >> >> >> >
> >> >> >> > But the machine of my first test is not under my control, should I
> >> >> >> > wait
> >> >> >> > for
> >> >> >> > the first machine to do more tests?
> >> >> >> > If there are still some problems in my tests, please tell me.
> >> >> >> >
> >> >> >> 
> >> >> >> Thanks a lot for your test, it seems that there is nothing wrong.
> >> >> >> And I haven't got any idea about more tests...
> >> >> >> 
> >> >> >> Could you provide the information of your cpu ?
> >> >> >> I will do some further investigation later.
> >> >> >> 
> >> >> > 
> >> >> > OK, of course, here is the information of cpu:
> >> >> > 
> >> >> > # lscpu
> >> >> > Architecture:          x86_64
> >> >> > CPU op-mode(s):        32-bit, 64-bit
> >> >> > Byte Order:            Little Endian
> >> >> > CPU(s):                48
> >> >> > On-line CPU(s) list:   0-47
> >> >> > Thread(s) per core:    1
> >> >> > Core(s) per socket:    6
> >> >> > Socket(s):             8
> >> >> > NUMA node(s):          8
> >> >> > Vendor ID:             AuthenticAMD
> >> >> > CPU family:            16
> >> >> > Model:                 8
> >> >> > Model name:            Six-Core AMD Opteron(tm) Processor 8439 SE
> >> >> > Stepping:              0
> >> >> > CPU MHz:               2793.040
> >> >> > BogoMIPS:              5586.22
> >> >> > Virtualization:        AMD-V
> >> >> > L1d cache:             64K
> >> >> > L1i cache:             64K
> >> >> > L2 cache:              512K
> >> >> > L3 cache:              5118K
> >> >> > NUMA node0 CPU(s):     0,8,16,24,32,40
> >> >> > NUMA node1 CPU(s):     1,9,17,25,33,41
> >> >> > NUMA node2 CPU(s):     2,10,18,26,34,42
> >> >> > NUMA node3 CPU(s):     3,11,19,27,35,43
> >> >> > NUMA node4 CPU(s):     4,12,20,28,36,44
> >> >> > NUMA node5 CPU(s):     5,13,21,29,37,45
> >> >> > NUMA node6 CPU(s):     6,14,22,30,38,46
> >> >> > NUMA node7 CPU(s):     7,15,23,31,39,47
> >> >> 
> >> >> This CPU assignment on NUMA nodes looks interesting. Is it possible
> >> >> that this affects performance of makedumpfile? This is just a guess.
> >> >> 
> >> >> Could you check whether the performance gets imporoved if you run each
> >> >> thread on the same NUMA node? For example:
> >> >> 
> >> >>   # taskset -c 0,8,16,24 makedumpfile --num-threads 4 -c -d 0 vmcore
> >> >>   vmcore-cd0
> >> >> 
> >> > Hi HATAYAMA,
> >> > 
> >> > I think your guess is right, but maybe your command has a little
> >> > problem.
> >> > 
> >> > From my test, the NUMA did affect the performance, but not too much.
> >> > The average time of cpus in the same NUMA node:
> >> > # taskset -c 0,8,16,24,32 makedumpfile --num-threads 4 -c -d 0 vmcore
> >> > vmcore-cd0
> >> > is 314s
> >> > The average time of cpus in different NUMA node:
> >> > # taskset -c 2,3,5,6,7 makedumpfile --num-threads 4 -c -d 0 vmcore
> >> > vmcore-cd0
> >> > is 354s
> >> >
> >> 
> >> Hmm, according to some previous discussion, what we should see here is
> >> whether it affects performance of makedumpfile with --num-threads 1
> >> and -d 31. So you should need to compare:
> >> 
> >>     # taskset 0,8 makedumpfile --num-threads 1 -c -d 31 vmcore vmcore-d31
> >> 
> >> with:
> >> 
> >>     # taskset 0 makedumpfile -c -d 0 vmcore vmcore-d31
> 
> I removed -c option wrongly. What I wanted to write is:
> 
>     # taskset -c 0,8 makedumpfile --num-threads 1 -d 31 vmcore vmcore-d31
> 
> and:
> 
>     # taskset -c 0 makedumpfile -d 31 vmcore vmcore-d31
> 
> just in case...
> 
Hi HATAYAMA,

the average time of
# taskset -c 0,8 makedumpfile --num-threads 1 -d 31 vmcore vmcore-d31
is 33s.
the average time of
# taskset -c 0 makedumpfile -d 31 vmcore vmcore-d31
is 18s.

My test steps:
1. change /etc/kdump/conf with
"core_collector makedumpfile -l --message-level 1 -d 31"
2. make a crash
3. cd into the directory of the vmcore made by kdump
4. in the directory of vmcore do
# taskset -c 0,8 makedumpfile --num-threads 1 -d 31 vmcore vmcore-d31
or
# taskset -c 0 makedumpfile -d 31 vmcore vmcore-d31

if there are there any problems, please tell me.

Thanks,
Chao Fan

> >> 
> >> Also, I'm assuming that you've done these benchmark on kdump 1st
> >> kernel, not kdump 2nd kernel. Is this correct?
> >> 
> > Hi HATAYAMA,
> > 
> > I test in the first kernel, not in the kdump second kernel.
> >
> 
> I see.
> 
> --
> Thanks.
> HATAYAMA, Daisuke
> _______________________________________________
> kexec mailing list
> kexec at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>