Chao, From: Chao Fan <cfan@xxxxxxxxxx> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing Date: Thu, 10 Dec 2015 05:54:28 -0500 > > > ----- Original Message ----- >> From: "Wenjian Zhou/???" <zhouwj-fnst at cn.fujitsu.com> >> To: "Chao Fan" <cfan at redhat.com> >> Cc: "Atsushi Kumagai" <ats-kumagai at wm.jp.nec.com>, kexec at lists.infradead.org >> Sent: Thursday, December 10, 2015 6:32:32 PM >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing >> >> On 12/10/2015 05:58 PM, Chao Fan wrote: >> > >> > >> > ----- Original Message ----- >> >> From: "Wenjian Zhou/???" <zhouwj-fnst at cn.fujitsu.com> >> >> To: "Atsushi Kumagai" <ats-kumagai at wm.jp.nec.com> >> >> Cc: kexec at lists.infradead.org >> >> Sent: Thursday, December 10, 2015 5:36:47 PM >> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing >> >> >> >> On 12/10/2015 04:14 PM, Atsushi Kumagai wrote: >> >>>> Hello Kumagai, >> >>>> >> >>>> On 12/04/2015 10:30 AM, Atsushi Kumagai wrote: >> >>>>> Hello, Zhou >> >>>>> >> >>>>>> On 12/02/2015 03:24 PM, Dave Young wrote: >> >>>>>>> Hi, >> >>>>>>> >> >>>>>>> On 12/02/15 at 01:29pm, "Zhou, Wenjian/???" wrote: >> >>>>>>>> I think there is no problem if other test results are as expected. >> >>>>>>>> >> >>>>>>>> --num-threads mainly reduces the time of compressing. >> >>>>>>>> So for lzo, it can't do much help at most of time. >> >>>>>>> >> >>>>>>> Seems the help of --num-threads does not say it exactly: >> >>>>>>> >> >>>>>>> [--num-threads THREADNUM]: >> >>>>>>> Using multiple threads to read and compress data of each >> >>>>>>> page >> >>>>>>> in parallel. >> >>>>>>> And it will reduces time for saving DUMPFILE. >> >>>>>>> This feature only supports creating DUMPFILE in >> >>>>>>> kdump-comressed format from >> >>>>>>> VMCORE in kdump-compressed format or elf format. >> >>>>>>> >> >>>>>>> Lzo is also a compress method, it should be mentioned that >> >>>>>>> --num-threads only >> >>>>>>> supports zlib compressed vmcore. >> >>>>>>> >> >>>>>> >> >>>>>> Sorry, it seems that something I said is not so clear. >> >>>>>> lzo is also supported. Since lzo compresses data at a high speed, the >> >>>>>> improving of the performance is not so obvious at most of time. >> >>>>>> >> >>>>>>> Also worth to mention about the recommended -d value for this >> >>>>>>> feature. >> >>>>>>> >> >>>>>> >> >>>>>> Yes, I think it's worth. I forgot it. >> >>>>> >> >>>>> I saw your patch, but I think I should confirm what is the problem >> >>>>> first. >> >>>>> >> >>>>>> However, when "-d 31" is specified, it will be worse. >> >>>>>> Less than 50 buffers are used to cache the compressed page. >> >>>>>> And even the page has been filtered, it will also take a buffer. >> >>>>>> So if "-d 31" is specified, the filtered page will use a lot >> >>>>>> of buffers. Then the page which needs to be compressed can't >> >>>>>> be compressed parallel. >> >>>>> >> >>>>> Could you explain why compression will not be parallel in more detail ? >> >>>>> Actually the buffers are used also for filtered pages, it sounds >> >>>>> inefficient. >> >>>>> However, I don't understand why it prevents parallel compression. >> >>>>> >> >>>> >> >>>> Think about this, in a huge memory, most of the page will be filtered, >> >>>> and >> >>>> we have 5 buffers. >> >>>> >> >>>> page1 page2 page3 page4 page5 page6 page7 >> >>>> ..... >> >>>> [buffer1] [2] [3] [4] [5] >> >>>> unfiltered filtered filtered filtered filtered unfiltered >> >>>> filtered >> >>>> >> >>>> Since filtered page will take a buffer, when compressing page1, >> >>>> page6 can't be compressed at the same time. >> >>>> That why it will prevent parallel compression. >> >>> >> >>> Thanks for your explanation, I understand. >> >>> This is just an issue of the current implementation, there is no >> >>> reason to stand this restriction. >> >>> >> >>>>> Further, according to Chao's benchmark, there is a big performance >> >>>>> degradation even if the number of thread is 1. (58s vs 240s) >> >>>>> The current implementation seems to have some problems, we should >> >>>>> solve them. >> >>>>> >> >>>> >> >>>> If "-d 31" is specified, on the one hand we can't save time by >> >>>> compressing >> >>>> parallel, on the other hand we will introduce some extra work by adding >> >>>> "--num-threads". So it is obvious that it will have a performance >> >>>> degradation. >> >>> >> >>> Sure, there must be some overhead due to "some extra work"(e.g. exclusive >> >>> lock), >> >>> but "--num-threads=1 is 4 times slower than --num-threads=0" still sounds >> >>> too slow, the degradation is too big to be called "some extra work". >> >>> >> >>> Both --num-threads=0 and --num-threads=1 are serial processing, >> >>> the above "buffer fairness issue" will not be related to this >> >>> degradation. >> >>> What do you think what make this degradation ? >> >>> >> >> >> >> I can't get such result at this moment, so I can't do some further >> >> investigation >> >> right now. I guess it may be caused by the underlying implementation of >> >> pthread. >> >> I reviewed the test result of the patch v2 and found in different >> >> machines, >> >> the results are quite different. >> > >> > Hi Zhou Wenjian, >> > >> > I have done more tests in another machine with 128G memory, and get the >> > result: >> > >> > the size of vmcore is 300M in "-d 31" >> > makedumpfile -l --message-level 1 -d 31: >> > time: 8.6s page-faults: 2272 >> > >> > makedumpfile -l --num-threads 1 --message-level 1 -d 31: >> > time: 28.1s page-faults: 2359 >> > >> > >> > and the size of vmcore is 2.6G in "-d 0". >> > In this machine, I get the same result as yours: >> > >> > >> > makedumpfile -c --message-level 1 -d 0: >> > time: 597s page-faults: 2287 >> > >> > makedumpfile -c --num-threads 1 --message-level 1 -d 0: >> > time: 602s page-faults: 2361 >> > >> > makedumpfile -c --num-threads 2 --message-level 1 -d 0: >> > time: 337s page-faults: 2397 >> > >> > makedumpfile -c --num-threads 4 --message-level 1 -d 0: >> > time: 175s page-faults: 2461 >> > >> > makedumpfile -c --num-threads 8 --message-level 1 -d 0: >> > time: 103s page-faults: 2611 >> > >> > >> > But the machine of my first test is not under my control, should I wait for >> > the first machine to do more tests? >> > If there are still some problems in my tests, please tell me. >> > >> >> Thanks a lot for your test, it seems that there is nothing wrong. >> And I haven't got any idea about more tests... >> >> Could you provide the information of your cpu ? >> I will do some further investigation later. >> > > OK, of course, here is the information of cpu: > > # lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 48 > On-line CPU(s) list: 0-47 > Thread(s) per core: 1 > Core(s) per socket: 6 > Socket(s): 8 > NUMA node(s): 8 > Vendor ID: AuthenticAMD > CPU family: 16 > Model: 8 > Model name: Six-Core AMD Opteron(tm) Processor 8439 SE > Stepping: 0 > CPU MHz: 2793.040 > BogoMIPS: 5586.22 > Virtualization: AMD-V > L1d cache: 64K > L1i cache: 64K > L2 cache: 512K > L3 cache: 5118K > NUMA node0 CPU(s): 0,8,16,24,32,40 > NUMA node1 CPU(s): 1,9,17,25,33,41 > NUMA node2 CPU(s): 2,10,18,26,34,42 > NUMA node3 CPU(s): 3,11,19,27,35,43 > NUMA node4 CPU(s): 4,12,20,28,36,44 > NUMA node5 CPU(s): 5,13,21,29,37,45 > NUMA node6 CPU(s): 6,14,22,30,38,46 > NUMA node7 CPU(s): 7,15,23,31,39,47 This CPU assignment on NUMA nodes looks interesting. Is it possible that this affects performance of makedumpfile? This is just a guess. Could you check whether the performance gets imporoved if you run each thread on the same NUMA node? For example: # taskset -c 0,8,16,24 makedumpfile --num-threads 4 -c -d 0 vmcore vmcore-cd0 Also, if this were cause of this performance degradation, we might need to extend nr_cpus= kernel option to choose NUMA nodes we use; though, we might already be able to do that in combination with other kernel features. > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall > >> But I still believe it's better not to use "-l -d 31" and "--num-threads" >> at the same time, though it's very strange that the performance >> degradation is so big. >> >> -- >> Thanks >> Zhou >> >> > Thanks, >> > Chao Fan >> > >> > >> >> >> >> It seems that I can get almost the same result of Chao from "PRIMEQUEST >> >> 1800E". >> >> >> >> ################################### >> >> - System: PRIMERGY RX300 S6 >> >> - CPU: Intel(R) Xeon(R) CPU x5660 >> >> - memory: 16GB >> >> ################################### >> >> ************ makedumpfile -d 7 ****************** >> >> core-data 0 256 >> >> threads-num >> >> -l >> >> 0 10 144 >> >> 4 5 110 >> >> 8 5 111 >> >> 12 6 111 >> >> >> >> ************ makedumpfile -d 31 ****************** >> >> core-data 0 256 >> >> threads-num >> >> -l >> >> 0 0 0 >> >> 4 2 2 >> >> 8 2 3 >> >> 12 2 3 >> >> >> >> ################################### >> >> - System: PRIMEQUEST 1800E >> >> - CPU: Intel(R) Xeon(R) CPU E7540 >> >> - memory: 32GB >> >> ################################### >> >> ************ makedumpfile -d 7 ****************** >> >> core-data 0 256 >> >> threads-num >> >> -l >> >> 0 34 270 >> >> 4 63 154 >> >> 8 64 131 >> >> 12 65 159 >> >> >> >> ************ makedumpfile -d 31 ****************** >> >> core-data 0 256 >> >> threads-num >> >> -l >> >> 0 2 1 >> >> 4 48 48 >> >> 8 48 49 >> >> 12 49 50 >> >> >> >>>> I'm not so sure if it is a problem that the performance degradation is >> >>>> so >> >>>> big. >> >>>> But I think if in other cases, it works as expected, this won't be a >> >>>> problem( >> >>>> or a problem needs to be fixed), for the performance degradation >> >>>> existing >> >>>> in theory. >> >>>> >> >>>> Or the current implementation should be replaced by a new arithmetic. >> >>>> For example: >> >>>> We can add an array to record whether the page is filtered or not. >> >>>> And only the unfiltered page will take the buffer. >> >>> >> >>> We should discuss how to implement new mechanism, I'll mention this >> >>> later. >> >>> >> >>>> But I'm not sure if it is worth. >> >>>> For "-l -d 31" is fast enough, the new arithmetic also can't do much >> >>>> help. >> >>> >> >>> Basically the faster, the better. There is no obvious target time. >> >>> If there is room for improvement, we should do it. >> >>> >> >> >> >> Maybe we can improve the performance of "-c -d 31" in some case. >> >> >> >> BTW, we can easily get the theoretical performance by using the "--split". >> >> >> >> -- >> >> Thanks >> >> Zhou >> >> >> >> >> >> >> >> _______________________________________________ >> >> kexec mailing list >> >> kexec at lists.infradead.org >> >> http://lists.infradead.org/mailman/listinfo/kexec >> >> >> >> >> >> >> _______________________________________________ >> kexec mailing list >> kexec at lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/kexec >> > > _______________________________________________ > kexec mailing list > kexec at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec -- Thanks. HATAYAMA, Daisuke