[PATCH RFC 00/11] makedumpfile: parallel processing

zhouwj-fnst@xxxxxxxxxxxxxx ("Zhou, Wenjian/周文剑") · Thu, 10 Dec 2015 18:32:32 +0800

On 12/10/2015 05:58 PM, Chao Fan wrote:
>
>
> ----- Original Message -----
>> From: "Wenjian Zhou/???" <zhouwj-fnst at cn.fujitsu.com>
>> To: "Atsushi Kumagai" <ats-kumagai at wm.jp.nec.com>
>> Cc: kexec at lists.infradead.org
>> Sent: Thursday, December 10, 2015 5:36:47 PM
>> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
>>
>> On 12/10/2015 04:14 PM, Atsushi Kumagai wrote:
>>>> Hello Kumagai,
>>>>
>>>> On 12/04/2015 10:30 AM, Atsushi Kumagai wrote:
>>>>> Hello, Zhou
>>>>>
>>>>>> On 12/02/2015 03:24 PM, Dave Young wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 12/02/15 at 01:29pm, "Zhou, Wenjian/???" wrote:
>>>>>>>> I think there is no problem if other test results are as expected.
>>>>>>>>
>>>>>>>> --num-threads mainly reduces the time of compressing.
>>>>>>>> So for lzo, it can't do much help at most of time.
>>>>>>>
>>>>>>> Seems the help of --num-threads does not say it exactly:
>>>>>>>
>>>>>>>       [--num-threads THREADNUM]:
>>>>>>>           Using multiple threads to read and compress data of each page
>>>>>>>           in parallel.
>>>>>>>           And it will reduces time for saving DUMPFILE.
>>>>>>>           This feature only supports creating DUMPFILE in
>>>>>>>           kdump-comressed format from
>>>>>>>           VMCORE in kdump-compressed format or elf format.
>>>>>>>
>>>>>>> Lzo is also a compress method, it should be mentioned that
>>>>>>> --num-threads only
>>>>>>> supports zlib compressed vmcore.
>>>>>>>
>>>>>>
>>>>>> Sorry, it seems that something I said is not so clear.
>>>>>> lzo is also supported. Since lzo compresses data at a high speed, the
>>>>>> improving of the performance is not so obvious at most of time.
>>>>>>
>>>>>>> Also worth to mention about the recommended -d value for this feature.
>>>>>>>
>>>>>>
>>>>>> Yes, I think it's worth. I forgot it.
>>>>>
>>>>> I saw your patch, but I think I should confirm what is the problem first.
>>>>>
>>>>>> However, when "-d 31" is specified, it will be worse.
>>>>>> Less than 50 buffers are used to cache the compressed page.
>>>>>> And even the page has been filtered, it will also take a buffer.
>>>>>> So if "-d 31" is specified, the filtered page will use a lot
>>>>>> of buffers. Then the page which needs to be compressed can't
>>>>>> be compressed parallel.
>>>>>
>>>>> Could you explain why compression will not be parallel in more detail ?
>>>>> Actually the buffers are used also for filtered pages, it sounds
>>>>> inefficient.
>>>>> However, I don't understand why it prevents parallel compression.
>>>>>
>>>>
>>>> Think about this, in a huge memory, most of the page will be filtered, and
>>>> we have 5 buffers.
>>>>
>>>> page1       page2      page3     page4     page5      page6       page7
>>>> .....
>>>> [buffer1]   [2]        [3]       [4]       [5]
>>>> unfiltered  filtered   filtered  filtered  filtered   unfiltered  filtered
>>>>
>>>> Since filtered page will take a buffer, when compressing page1,
>>>> page6 can't be compressed at the same time.
>>>> That why it will prevent parallel compression.
>>>
>>> Thanks for your explanation, I understand.
>>> This is just an issue of the current implementation, there is no
>>> reason to stand this restriction.
>>>
>>>>> Further, according to Chao's benchmark, there is a big performance
>>>>> degradation even if the number of thread is 1. (58s vs 240s)
>>>>> The current implementation seems to have some problems, we should
>>>>> solve them.
>>>>>
>>>>
>>>> If "-d 31" is specified, on the one hand we can't save time by compressing
>>>> parallel, on the other hand we will introduce some extra work by adding
>>>> "--num-threads". So it is obvious that it will have a performance
>>>> degradation.
>>>
>>> Sure, there must be some overhead due to "some extra work"(e.g. exclusive
>>> lock),
>>> but "--num-threads=1 is 4 times slower than --num-threads=0" still sounds
>>> too slow, the degradation is too big to be called "some extra work".
>>>
>>> Both --num-threads=0 and --num-threads=1 are serial processing,
>>> the above "buffer fairness issue" will not be related to this degradation.
>>> What do you think what make this degradation ?
>>>
>>
>> I can't get such result at this moment, so I can't do some further
>> investigation
>> right now. I guess it may be caused by the underlying implementation of
>> pthread.
>> I reviewed the test result of the patch v2 and found in different machines,
>> the results are quite different.
>
> Hi Zhou Wenjian,
>
> I have done more tests in another machine with 128G memory, and get the result:
>
> the size of vmcore is 300M in "-d 31"
> makedumpfile -l --message-level 1 -d 31:
> time: 8.6s      page-faults: 2272
>
> makedumpfile -l --num-threads 1 --message-level 1 -d 31:
> time: 28.1s     page-faults: 2359
>
>
> and the size of vmcore is 2.6G in "-d 0".
> In this machine, I get the same result as yours:
>
>
> makedumpfile -c --message-level 1 -d 0:
> time: 597s      page-faults: 2287
>
> makedumpfile -c --num-threads 1 --message-level 1 -d 0:
> time: 602s      page-faults: 2361
>
> makedumpfile -c --num-threads 2 --message-level 1 -d 0:
> time: 337s      page-faults: 2397
>
> makedumpfile -c --num-threads 4 --message-level 1 -d 0:
> time: 175s      page-faults: 2461
>
> makedumpfile -c --num-threads 8 --message-level 1 -d 0:
> time: 103s      page-faults: 2611
>
>
> But the machine of my first test is not under my control, should I wait for
> the first machine to do more tests?
> If there are still some problems in my tests, please tell me.
>

Thanks a lot for your test, it seems that there is nothing wrong.
And I haven't got any idea about more tests...

Could you provide the information of your cpu ?
I will do some further investigation later.

But I still believe it's better not to use "-l -d 31" and "--num-threads"
at the same time, though it's very strange that the performance
degradation is so big.

-- 
Thanks
Zhou

> Thanks,
> Chao Fan
>
>
>>
>> It seems that I can get almost the same result of Chao from "PRIMEQUEST
>> 1800E".
>>
>> ###################################
>> - System: PRIMERGY RX300 S6
>> - CPU: Intel(R) Xeon(R) CPU x5660
>> - memory: 16GB
>> ###################################
>> ************ makedumpfile -d 7 ******************
>>                   core-data       0       256
>>           threads-num
>> -l
>>           0                       10      144
>>           4                       5       110
>>           8                       5       111
>>           12                      6       111
>>
>> ************ makedumpfile -d 31 ******************
>>                   core-data       0       256
>>           threads-num
>> -l
>>           0                       0       0
>>           4                       2       2
>>           8                       2       3
>>           12                      2       3
>>
>> ###################################
>> - System: PRIMEQUEST 1800E
>> - CPU: Intel(R) Xeon(R) CPU E7540
>> - memory: 32GB
>> ###################################
>> ************ makedumpfile -d 7 ******************
>>                   core-data        0       256
>>           threads-num
>> -l
>>           0                        34      270
>>           4                        63      154
>>           8                        64      131
>>           12                       65      159
>>
>> ************ makedumpfile -d 31 ******************
>>                   core-data        0       256
>>           threads-num
>> -l
>>           0                        2       1
>>           4                        48      48
>>           8                        48      49
>>           12                       49      50
>>
>>>> I'm not so sure if it is a problem that the performance degradation is so
>>>> big.
>>>> But I think if in other cases, it works as expected, this won't be a
>>>> problem(
>>>> or a problem needs to be fixed), for the performance degradation existing
>>>> in theory.
>>>>
>>>> Or the current implementation should be replaced by a new arithmetic.
>>>> For example:
>>>> We can add an array to record whether the page is filtered or not.
>>>> And only the unfiltered page will take the buffer.
>>>
>>> We should discuss how to implement new mechanism, I'll mention this later.
>>>
>>>> But I'm not sure if it is worth.
>>>> For "-l -d 31" is fast enough, the new arithmetic also can't do much help.
>>>
>>> Basically the faster, the better. There is no obvious target time.
>>> If there is room for improvement, we should do it.
>>>
>>
>> Maybe we can improve the performance of "-c -d 31" in some case.
>>
>> BTW, we can easily get the theoretical performance by using the "--split".
>>
>> --
>> Thanks
>> Zhou
>>
>>
>>
>> _______________________________________________
>> kexec mailing list
>> kexec at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
>>