>>> Think about this, in a huge memory, most of the page will be filtered, and >>> we have 5 buffers. >>> >>> page1 page2 page3 page4 page5 page6 page7 ..... >>> [buffer1] [2] [3] [4] [5] >>> unfiltered filtered filtered filtered filtered unfiltered filtered >>> >>> Since filtered page will take a buffer, when compressing page1, >>> page6 can't be compressed at the same time. >>> That why it will prevent parallel compression. >> >> Thanks for your explanation, I understand. >> This is just an issue of the current implementation, there is no >> reason to stand this restriction. >> >>>> Further, according to Chao's benchmark, there is a big performance >>>> degradation even if the number of thread is 1. (58s vs 240s) >>>> The current implementation seems to have some problems, we should >>>> solve them. >>>> >>> >>> If "-d 31" is specified, on the one hand we can't save time by compressing >>> parallel, on the other hand we will introduce some extra work by adding >>> "--num-threads". So it is obvious that it will have a performance degradation. >> >> Sure, there must be some overhead due to "some extra work"(e.g. exclusive lock), >> but "--num-threads=1 is 4 times slower than --num-threads=0" still sounds >> too slow, the degradation is too big to be called "some extra work". >> >> Both --num-threads=0 and --num-threads=1 are serial processing, >> the above "buffer fairness issue" will not be related to this degradation. >> What do you think what make this degradation ? >> > >I can't get such result at this moment, so I can't do some further investigation >right now. I guess it may be caused by the underlying implementation of pthread. >I reviewed the test result of the patch v2 and found in different machines, >the results are quite different. Unluckily, I also can't reproduce such big degradation. According to the Chao's verification, this issue seems different form the "too many page fault issue" that we solved. I have no ideas, but at least I want to confirm whether this issue is avoidable or not. >It seems that I can get almost the same result of Chao from "PRIMEQUEST 1800E". > >################################### >- System: PRIMERGY RX300 S6 >- CPU: Intel(R) Xeon(R) CPU x5660 >- memory: 16GB >################################### >************ makedumpfile -d 7 ****************** > core-data 0 256 > threads-num >-l > 0 10 144 > 4 5 110 > 8 5 111 > 12 6 111 > >************ makedumpfile -d 31 ****************** > core-data 0 256 > threads-num >-l > 0 0 0 > 4 2 2 > 8 2 3 > 12 2 3 > >################################### >- System: PRIMEQUEST 1800E >- CPU: Intel(R) Xeon(R) CPU E7540 >- memory: 32GB >################################### >************ makedumpfile -d 7 ****************** > core-data 0 256 > threads-num >-l > 0 34 270 > 4 63 154 > 8 64 131 > 12 65 159 > >************ makedumpfile -d 31 ****************** > core-data 0 256 > threads-num >-l > 0 2 1 > 4 48 48 > 8 48 49 > 12 49 50 > >>> I'm not so sure if it is a problem that the performance degradation is so big. >>> But I think if in other cases, it works as expected, this won't be a problem( >>> or a problem needs to be fixed), for the performance degradation existing >>> in theory. >>> >>> Or the current implementation should be replaced by a new arithmetic. >>> For example: >>> We can add an array to record whether the page is filtered or not. >>> And only the unfiltered page will take the buffer. >> >> We should discuss how to implement new mechanism, I'll mention this later. >> >>> But I'm not sure if it is worth. >>> For "-l -d 31" is fast enough, the new arithmetic also can't do much help. >> >> Basically the faster, the better. There is no obvious target time. >> If there is room for improvement, we should do it. >> > >Maybe we can improve the performance of "-c -d 31" in some case. Yes, the buffer is used for -c, -l and -p, not only for -l. It would be useful to improve that. >BTW, we can easily get the theoretical performance by using the "--split". Are you sure ? You persuaded me in the thread below: http://lists.infradead.org/pipermail/kexec/2015-June/013881.html --num-threads is orthogonal to --split, it's better to use the both option since they try to solve different bottlenecks. That's why I decided to merge your multi thread feature. However, what you said sounds --split is a superset of --num-threads. You don't need the multi thread feature ? Thanks, Atsushi Kumagai