On 12/10/2015 04:14 PM, Atsushi Kumagai wrote: >> Hello Kumagai, >> >> On 12/04/2015 10:30 AM, Atsushi Kumagai wrote: >>> Hello, Zhou >>> >>>> On 12/02/2015 03:24 PM, Dave Young wrote: >>>>> Hi, >>>>> >>>>> On 12/02/15 at 01:29pm, "Zhou, Wenjian/???" wrote: >>>>>> I think there is no problem if other test results are as expected. >>>>>> >>>>>> --num-threads mainly reduces the time of compressing. >>>>>> So for lzo, it can't do much help at most of time. >>>>> >>>>> Seems the help of --num-threads does not say it exactly: >>>>> >>>>> [--num-threads THREADNUM]: >>>>> Using multiple threads to read and compress data of each page in parallel. >>>>> And it will reduces time for saving DUMPFILE. >>>>> This feature only supports creating DUMPFILE in kdump-comressed format from >>>>> VMCORE in kdump-compressed format or elf format. >>>>> >>>>> Lzo is also a compress method, it should be mentioned that --num-threads only >>>>> supports zlib compressed vmcore. >>>>> >>>> >>>> Sorry, it seems that something I said is not so clear. >>>> lzo is also supported. Since lzo compresses data at a high speed, the >>>> improving of the performance is not so obvious at most of time. >>>> >>>>> Also worth to mention about the recommended -d value for this feature. >>>>> >>>> >>>> Yes, I think it's worth. I forgot it. >>> >>> I saw your patch, but I think I should confirm what is the problem first. >>> >>>> However, when "-d 31" is specified, it will be worse. >>>> Less than 50 buffers are used to cache the compressed page. >>>> And even the page has been filtered, it will also take a buffer. >>>> So if "-d 31" is specified, the filtered page will use a lot >>>> of buffers. Then the page which needs to be compressed can't >>>> be compressed parallel. >>> >>> Could you explain why compression will not be parallel in more detail ? >>> Actually the buffers are used also for filtered pages, it sounds inefficient. >>> However, I don't understand why it prevents parallel compression. >>> >> >> Think about this, in a huge memory, most of the page will be filtered, and >> we have 5 buffers. >> >> page1 page2 page3 page4 page5 page6 page7 ..... >> [buffer1] [2] [3] [4] [5] >> unfiltered filtered filtered filtered filtered unfiltered filtered >> >> Since filtered page will take a buffer, when compressing page1, >> page6 can't be compressed at the same time. >> That why it will prevent parallel compression. > > Thanks for your explanation, I understand. > This is just an issue of the current implementation, there is no > reason to stand this restriction. > >>> Further, according to Chao's benchmark, there is a big performance >>> degradation even if the number of thread is 1. (58s vs 240s) >>> The current implementation seems to have some problems, we should >>> solve them. >>> >> >> If "-d 31" is specified, on the one hand we can't save time by compressing >> parallel, on the other hand we will introduce some extra work by adding >> "--num-threads". So it is obvious that it will have a performance degradation. > > Sure, there must be some overhead due to "some extra work"(e.g. exclusive lock), > but "--num-threads=1 is 4 times slower than --num-threads=0" still sounds > too slow, the degradation is too big to be called "some extra work". > > Both --num-threads=0 and --num-threads=1 are serial processing, > the above "buffer fairness issue" will not be related to this degradation. > What do you think what make this degradation ? > I can't get such result at this moment, so I can't do some further investigation right now. I guess it may be caused by the underlying implementation of pthread. I reviewed the test result of the patch v2 and found in different machines, the results are quite different. It seems that I can get almost the same result of Chao from "PRIMEQUEST 1800E". ################################### - System: PRIMERGY RX300 S6 - CPU: Intel(R) Xeon(R) CPU x5660 - memory: 16GB ################################### ************ makedumpfile -d 7 ****************** core-data 0 256 threads-num -l 0 10 144 4 5 110 8 5 111 12 6 111 ************ makedumpfile -d 31 ****************** core-data 0 256 threads-num -l 0 0 0 4 2 2 8 2 3 12 2 3 ################################### - System: PRIMEQUEST 1800E - CPU: Intel(R) Xeon(R) CPU E7540 - memory: 32GB ################################### ************ makedumpfile -d 7 ****************** core-data 0 256 threads-num -l 0 34 270 4 63 154 8 64 131 12 65 159 ************ makedumpfile -d 31 ****************** core-data 0 256 threads-num -l 0 2 1 4 48 48 8 48 49 12 49 50 >> I'm not so sure if it is a problem that the performance degradation is so big. >> But I think if in other cases, it works as expected, this won't be a problem( >> or a problem needs to be fixed), for the performance degradation existing >> in theory. >> >> Or the current implementation should be replaced by a new arithmetic. >> For example: >> We can add an array to record whether the page is filtered or not. >> And only the unfiltered page will take the buffer. > > We should discuss how to implement new mechanism, I'll mention this later. > >> But I'm not sure if it is worth. >> For "-l -d 31" is fast enough, the new arithmetic also can't do much help. > > Basically the faster, the better. There is no obvious target time. > If there is room for improvement, we should do it. > Maybe we can improve the performance of "-c -d 31" in some case. BTW, we can easily get the theoretical performance by using the "--split". -- Thanks Zhou