On 12/14/2015 04:26 PM, Atsushi Kumagai wrote: >>>> Think about this, in a huge memory, most of the page will be filtered, and >>>> we have 5 buffers. >>>> >>>> page1 page2 page3 page4 page5 page6 page7 ..... >>>> [buffer1] [2] [3] [4] [5] >>>> unfiltered filtered filtered filtered filtered unfiltered filtered >>>> >>>> Since filtered page will take a buffer, when compressing page1, >>>> page6 can't be compressed at the same time. >>>> That why it will prevent parallel compression. >>> >>> Thanks for your explanation, I understand. >>> This is just an issue of the current implementation, there is no >>> reason to stand this restriction. >>> >>>>> Further, according to Chao's benchmark, there is a big performance >>>>> degradation even if the number of thread is 1. (58s vs 240s) >>>>> The current implementation seems to have some problems, we should >>>>> solve them. >>>>> >>>> >>>> If "-d 31" is specified, on the one hand we can't save time by compressing >>>> parallel, on the other hand we will introduce some extra work by adding >>>> "--num-threads". So it is obvious that it will have a performance degradation. >>> >>> Sure, there must be some overhead due to "some extra work"(e.g. exclusive lock), >>> but "--num-threads=1 is 4 times slower than --num-threads=0" still sounds >>> too slow, the degradation is too big to be called "some extra work". >>> >>> Both --num-threads=0 and --num-threads=1 are serial processing, >>> the above "buffer fairness issue" will not be related to this degradation. >>> What do you think what make this degradation ? >>> >> >> I can't get such result at this moment, so I can't do some further investigation >> right now. I guess it may be caused by the underlying implementation of pthread. >> I reviewed the test result of the patch v2 and found in different machines, >> the results are quite different. > > Unluckily, I also can't reproduce such big degradation. > According to the Chao's verification, this issue seems different form > the "too many page fault issue" that we solved. > I have no ideas, but at least I want to confirm whether this issue > is avoidable or not. > >> It seems that I can get almost the same result of Chao from "PRIMEQUEST 1800E". >> >> ################################### >> - System: PRIMERGY RX300 S6 >> - CPU: Intel(R) Xeon(R) CPU x5660 >> - memory: 16GB >> ################################### >> ************ makedumpfile -d 7 ****************** >> core-data 0 256 >> threads-num >> -l >> 0 10 144 >> 4 5 110 >> 8 5 111 >> 12 6 111 >> >> ************ makedumpfile -d 31 ****************** >> core-data 0 256 >> threads-num >> -l >> 0 0 0 >> 4 2 2 >> 8 2 3 >> 12 2 3 >> >> ################################### >> - System: PRIMEQUEST 1800E >> - CPU: Intel(R) Xeon(R) CPU E7540 >> - memory: 32GB >> ################################### >> ************ makedumpfile -d 7 ****************** >> core-data 0 256 >> threads-num >> -l >> 0 34 270 >> 4 63 154 >> 8 64 131 >> 12 65 159 >> >> ************ makedumpfile -d 31 ****************** >> core-data 0 256 >> threads-num >> -l >> 0 2 1 >> 4 48 48 >> 8 48 49 >> 12 49 50 >> >>>> I'm not so sure if it is a problem that the performance degradation is so big. >>>> But I think if in other cases, it works as expected, this won't be a problem( >>>> or a problem needs to be fixed), for the performance degradation existing >>>> in theory. >>>> >>>> Or the current implementation should be replaced by a new arithmetic. >>>> For example: >>>> We can add an array to record whether the page is filtered or not. >>>> And only the unfiltered page will take the buffer. >>> >>> We should discuss how to implement new mechanism, I'll mention this later. >>> >>>> But I'm not sure if it is worth. >>>> For "-l -d 31" is fast enough, the new arithmetic also can't do much help. >>> >>> Basically the faster, the better. There is no obvious target time. >>> If there is room for improvement, we should do it. >>> >> >> Maybe we can improve the performance of "-c -d 31" in some case. > > Yes, the buffer is used for -c, -l and -p, not only for -l. > It would be useful to improve that. > >> BTW, we can easily get the theoretical performance by using the "--split". > > Are you sure ? You persuaded me in the thread below: > > http://lists.infradead.org/pipermail/kexec/2015-June/013881.html > > --num-threads is orthogonal to --split, it's better to use the both > option since they try to solve different bottlenecks. > That's why I decided to merge your multi thread feature. > > However, what you said sounds --split is a superset of --num-threads. > You don't need the multi thread feature ? > I just mean the performance. There is no doubt that we will use multi-threads in --split in the future. But as we all known, threads and processes have some common characters. And in makedumpfile, if we use "--split core1 core2 core3 core4" and "--num-threads 4" separately, the spent time should not be quite different. Since the logic of "--split" is more simple, if we can't improve the performance of "-l -d 31" by "--split", we also don't have much chance to do it by "--num-threads". I just mean that. It is of course that --split is not a super set of --num-threads. -- Thanks Zhou