On 12/10/2015 05:58 PM, Chao Fan wrote: > > > ----- Original Message ----- >> From: "Wenjian Zhou/???" <zhouwj-fnst at cn.fujitsu.com> >> To: "Atsushi Kumagai" <ats-kumagai at wm.jp.nec.com> >> Cc: kexec at lists.infradead.org >> Sent: Thursday, December 10, 2015 5:36:47 PM >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing >> >> On 12/10/2015 04:14 PM, Atsushi Kumagai wrote: >>>> Hello Kumagai, >>>> >>>> On 12/04/2015 10:30 AM, Atsushi Kumagai wrote: >>>>> Hello, Zhou >>>>> >>>>>> On 12/02/2015 03:24 PM, Dave Young wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On 12/02/15 at 01:29pm, "Zhou, Wenjian/???" wrote: >>>>>>>> I think there is no problem if other test results are as expected. >>>>>>>> >>>>>>>> --num-threads mainly reduces the time of compressing. >>>>>>>> So for lzo, it can't do much help at most of time. >>>>>>> >>>>>>> Seems the help of --num-threads does not say it exactly: >>>>>>> >>>>>>> [--num-threads THREADNUM]: >>>>>>> Using multiple threads to read and compress data of each page >>>>>>> in parallel. >>>>>>> And it will reduces time for saving DUMPFILE. >>>>>>> This feature only supports creating DUMPFILE in >>>>>>> kdump-comressed format from >>>>>>> VMCORE in kdump-compressed format or elf format. >>>>>>> >>>>>>> Lzo is also a compress method, it should be mentioned that >>>>>>> --num-threads only >>>>>>> supports zlib compressed vmcore. >>>>>>> >>>>>> >>>>>> Sorry, it seems that something I said is not so clear. >>>>>> lzo is also supported. Since lzo compresses data at a high speed, the >>>>>> improving of the performance is not so obvious at most of time. >>>>>> >>>>>>> Also worth to mention about the recommended -d value for this feature. >>>>>>> >>>>>> >>>>>> Yes, I think it's worth. I forgot it. >>>>> >>>>> I saw your patch, but I think I should confirm what is the problem first. >>>>> >>>>>> However, when "-d 31" is specified, it will be worse. >>>>>> Less than 50 buffers are used to cache the compressed page. >>>>>> And even the page has been filtered, it will also take a buffer. >>>>>> So if "-d 31" is specified, the filtered page will use a lot >>>>>> of buffers. Then the page which needs to be compressed can't >>>>>> be compressed parallel. >>>>> >>>>> Could you explain why compression will not be parallel in more detail ? >>>>> Actually the buffers are used also for filtered pages, it sounds >>>>> inefficient. >>>>> However, I don't understand why it prevents parallel compression. >>>>> >>>> >>>> Think about this, in a huge memory, most of the page will be filtered, and >>>> we have 5 buffers. >>>> >>>> page1 page2 page3 page4 page5 page6 page7 >>>> ..... >>>> [buffer1] [2] [3] [4] [5] >>>> unfiltered filtered filtered filtered filtered unfiltered filtered >>>> >>>> Since filtered page will take a buffer, when compressing page1, >>>> page6 can't be compressed at the same time. >>>> That why it will prevent parallel compression. >>> >>> Thanks for your explanation, I understand. >>> This is just an issue of the current implementation, there is no >>> reason to stand this restriction. >>> >>>>> Further, according to Chao's benchmark, there is a big performance >>>>> degradation even if the number of thread is 1. (58s vs 240s) >>>>> The current implementation seems to have some problems, we should >>>>> solve them. >>>>> >>>> >>>> If "-d 31" is specified, on the one hand we can't save time by compressing >>>> parallel, on the other hand we will introduce some extra work by adding >>>> "--num-threads". So it is obvious that it will have a performance >>>> degradation. >>> >>> Sure, there must be some overhead due to "some extra work"(e.g. exclusive >>> lock), >>> but "--num-threads=1 is 4 times slower than --num-threads=0" still sounds >>> too slow, the degradation is too big to be called "some extra work". >>> >>> Both --num-threads=0 and --num-threads=1 are serial processing, >>> the above "buffer fairness issue" will not be related to this degradation. >>> What do you think what make this degradation ? >>> >> >> I can't get such result at this moment, so I can't do some further >> investigation >> right now. I guess it may be caused by the underlying implementation of >> pthread. >> I reviewed the test result of the patch v2 and found in different machines, >> the results are quite different. > > Hi Zhou Wenjian, > > I have done more tests in another machine with 128G memory, and get the result: > > the size of vmcore is 300M in "-d 31" > makedumpfile -l --message-level 1 -d 31: > time: 8.6s page-faults: 2272 > > makedumpfile -l --num-threads 1 --message-level 1 -d 31: > time: 28.1s page-faults: 2359 > > > and the size of vmcore is 2.6G in "-d 0". > In this machine, I get the same result as yours: > > > makedumpfile -c --message-level 1 -d 0: > time: 597s page-faults: 2287 > > makedumpfile -c --num-threads 1 --message-level 1 -d 0: > time: 602s page-faults: 2361 > > makedumpfile -c --num-threads 2 --message-level 1 -d 0: > time: 337s page-faults: 2397 > > makedumpfile -c --num-threads 4 --message-level 1 -d 0: > time: 175s page-faults: 2461 > > makedumpfile -c --num-threads 8 --message-level 1 -d 0: > time: 103s page-faults: 2611 > > > But the machine of my first test is not under my control, should I wait for > the first machine to do more tests? > If there are still some problems in my tests, please tell me. > Thanks a lot for your test, it seems that there is nothing wrong. And I haven't got any idea about more tests... Could you provide the information of your cpu ? I will do some further investigation later. But I still believe it's better not to use "-l -d 31" and "--num-threads" at the same time, though it's very strange that the performance degradation is so big. -- Thanks Zhou > Thanks, > Chao Fan > > >> >> It seems that I can get almost the same result of Chao from "PRIMEQUEST >> 1800E". >> >> ################################### >> - System: PRIMERGY RX300 S6 >> - CPU: Intel(R) Xeon(R) CPU x5660 >> - memory: 16GB >> ################################### >> ************ makedumpfile -d 7 ****************** >> core-data 0 256 >> threads-num >> -l >> 0 10 144 >> 4 5 110 >> 8 5 111 >> 12 6 111 >> >> ************ makedumpfile -d 31 ****************** >> core-data 0 256 >> threads-num >> -l >> 0 0 0 >> 4 2 2 >> 8 2 3 >> 12 2 3 >> >> ################################### >> - System: PRIMEQUEST 1800E >> - CPU: Intel(R) Xeon(R) CPU E7540 >> - memory: 32GB >> ################################### >> ************ makedumpfile -d 7 ****************** >> core-data 0 256 >> threads-num >> -l >> 0 34 270 >> 4 63 154 >> 8 64 131 >> 12 65 159 >> >> ************ makedumpfile -d 31 ****************** >> core-data 0 256 >> threads-num >> -l >> 0 2 1 >> 4 48 48 >> 8 48 49 >> 12 49 50 >> >>>> I'm not so sure if it is a problem that the performance degradation is so >>>> big. >>>> But I think if in other cases, it works as expected, this won't be a >>>> problem( >>>> or a problem needs to be fixed), for the performance degradation existing >>>> in theory. >>>> >>>> Or the current implementation should be replaced by a new arithmetic. >>>> For example: >>>> We can add an array to record whether the page is filtered or not. >>>> And only the unfiltered page will take the buffer. >>> >>> We should discuss how to implement new mechanism, I'll mention this later. >>> >>>> But I'm not sure if it is worth. >>>> For "-l -d 31" is fast enough, the new arithmetic also can't do much help. >>> >>> Basically the faster, the better. There is no obvious target time. >>> If there is room for improvement, we should do it. >>> >> >> Maybe we can improve the performance of "-c -d 31" in some case. >> >> BTW, we can easily get the theoretical performance by using the "--split". >> >> -- >> Thanks >> Zhou >> >> >> >> _______________________________________________ >> kexec mailing list >> kexec at lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/kexec >>