----- Original Message ----- > From: "Wenjian Zhou/???" <zhouwj-fnst at cn.fujitsu.com> > To: "Atsushi Kumagai" <ats-kumagai at wm.jp.nec.com> > Cc: kexec at lists.infradead.org > Sent: Thursday, December 10, 2015 5:36:47 PM > Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing > > On 12/10/2015 04:14 PM, Atsushi Kumagai wrote: > >> Hello Kumagai, > >> > >> On 12/04/2015 10:30 AM, Atsushi Kumagai wrote: > >>> Hello, Zhou > >>> > >>>> On 12/02/2015 03:24 PM, Dave Young wrote: > >>>>> Hi, > >>>>> > >>>>> On 12/02/15 at 01:29pm, "Zhou, Wenjian/???" wrote: > >>>>>> I think there is no problem if other test results are as expected. > >>>>>> > >>>>>> --num-threads mainly reduces the time of compressing. > >>>>>> So for lzo, it can't do much help at most of time. > >>>>> > >>>>> Seems the help of --num-threads does not say it exactly: > >>>>> > >>>>> [--num-threads THREADNUM]: > >>>>> Using multiple threads to read and compress data of each page > >>>>> in parallel. > >>>>> And it will reduces time for saving DUMPFILE. > >>>>> This feature only supports creating DUMPFILE in > >>>>> kdump-comressed format from > >>>>> VMCORE in kdump-compressed format or elf format. > >>>>> > >>>>> Lzo is also a compress method, it should be mentioned that > >>>>> --num-threads only > >>>>> supports zlib compressed vmcore. > >>>>> > >>>> > >>>> Sorry, it seems that something I said is not so clear. > >>>> lzo is also supported. Since lzo compresses data at a high speed, the > >>>> improving of the performance is not so obvious at most of time. > >>>> > >>>>> Also worth to mention about the recommended -d value for this feature. > >>>>> > >>>> > >>>> Yes, I think it's worth. I forgot it. > >>> > >>> I saw your patch, but I think I should confirm what is the problem first. > >>> > >>>> However, when "-d 31" is specified, it will be worse. > >>>> Less than 50 buffers are used to cache the compressed page. > >>>> And even the page has been filtered, it will also take a buffer. > >>>> So if "-d 31" is specified, the filtered page will use a lot > >>>> of buffers. Then the page which needs to be compressed can't > >>>> be compressed parallel. > >>> > >>> Could you explain why compression will not be parallel in more detail ? > >>> Actually the buffers are used also for filtered pages, it sounds > >>> inefficient. > >>> However, I don't understand why it prevents parallel compression. > >>> > >> > >> Think about this, in a huge memory, most of the page will be filtered, and > >> we have 5 buffers. > >> > >> page1 page2 page3 page4 page5 page6 page7 > >> ..... > >> [buffer1] [2] [3] [4] [5] > >> unfiltered filtered filtered filtered filtered unfiltered filtered > >> > >> Since filtered page will take a buffer, when compressing page1, > >> page6 can't be compressed at the same time. > >> That why it will prevent parallel compression. > > > > Thanks for your explanation, I understand. > > This is just an issue of the current implementation, there is no > > reason to stand this restriction. > > > >>> Further, according to Chao's benchmark, there is a big performance > >>> degradation even if the number of thread is 1. (58s vs 240s) > >>> The current implementation seems to have some problems, we should > >>> solve them. > >>> > >> > >> If "-d 31" is specified, on the one hand we can't save time by compressing > >> parallel, on the other hand we will introduce some extra work by adding > >> "--num-threads". So it is obvious that it will have a performance > >> degradation. > > > > Sure, there must be some overhead due to "some extra work"(e.g. exclusive > > lock), > > but "--num-threads=1 is 4 times slower than --num-threads=0" still sounds > > too slow, the degradation is too big to be called "some extra work". > > > > Both --num-threads=0 and --num-threads=1 are serial processing, > > the above "buffer fairness issue" will not be related to this degradation. > > What do you think what make this degradation ? > > > > I can't get such result at this moment, so I can't do some further > investigation > right now. I guess it may be caused by the underlying implementation of > pthread. > I reviewed the test result of the patch v2 and found in different machines, > the results are quite different. Hi Zhou Wenjian, I have done more tests in another machine with 128G memory, and get the result: the size of vmcore is 300M in "-d 31" makedumpfile -l --message-level 1 -d 31: time: 8.6s page-faults: 2272 makedumpfile -l --num-threads 1 --message-level 1 -d 31: time: 28.1s page-faults: 2359 and the size of vmcore is 2.6G in "-d 0". In this machine, I get the same result as yours: makedumpfile -c --message-level 1 -d 0: time: 597s page-faults: 2287 makedumpfile -c --num-threads 1 --message-level 1 -d 0: time: 602s page-faults: 2361 makedumpfile -c --num-threads 2 --message-level 1 -d 0: time: 337s page-faults: 2397 makedumpfile -c --num-threads 4 --message-level 1 -d 0: time: 175s page-faults: 2461 makedumpfile -c --num-threads 8 --message-level 1 -d 0: time: 103s page-faults: 2611 But the machine of my first test is not under my control, should I wait for the first machine to do more tests? If there are still some problems in my tests, please tell me. Thanks, Chao Fan > > It seems that I can get almost the same result of Chao from "PRIMEQUEST > 1800E". > > ################################### > - System: PRIMERGY RX300 S6 > - CPU: Intel(R) Xeon(R) CPU x5660 > - memory: 16GB > ################################### > ************ makedumpfile -d 7 ****************** > core-data 0 256 > threads-num > -l > 0 10 144 > 4 5 110 > 8 5 111 > 12 6 111 > > ************ makedumpfile -d 31 ****************** > core-data 0 256 > threads-num > -l > 0 0 0 > 4 2 2 > 8 2 3 > 12 2 3 > > ################################### > - System: PRIMEQUEST 1800E > - CPU: Intel(R) Xeon(R) CPU E7540 > - memory: 32GB > ################################### > ************ makedumpfile -d 7 ****************** > core-data 0 256 > threads-num > -l > 0 34 270 > 4 63 154 > 8 64 131 > 12 65 159 > > ************ makedumpfile -d 31 ****************** > core-data 0 256 > threads-num > -l > 0 2 1 > 4 48 48 > 8 48 49 > 12 49 50 > > >> I'm not so sure if it is a problem that the performance degradation is so > >> big. > >> But I think if in other cases, it works as expected, this won't be a > >> problem( > >> or a problem needs to be fixed), for the performance degradation existing > >> in theory. > >> > >> Or the current implementation should be replaced by a new arithmetic. > >> For example: > >> We can add an array to record whether the page is filtered or not. > >> And only the unfiltered page will take the buffer. > > > > We should discuss how to implement new mechanism, I'll mention this later. > > > >> But I'm not sure if it is worth. > >> For "-l -d 31" is fast enough, the new arithmetic also can't do much help. > > > > Basically the faster, the better. There is no obvious target time. > > If there is room for improvement, we should do it. > > > > Maybe we can improve the performance of "-c -d 31" in some case. > > BTW, we can easily get the theoretical performance by using the "--split". > > -- > Thanks > Zhou > > > > _______________________________________________ > kexec mailing list > kexec at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec >