ping... -- Thanks Zhou Wenjian On 07/31/2015 05:35 PM, "Zhou, Wenjian/???" wrote: > On 07/31/2015 04:27 PM, Atsushi Kumagai wrote: >>> On 07/23/2015 02:20 PM, Atsushi Kumagai wrote: >>>>> Hello Kumagai, >>>>> >>>>> The PATCH v3 has improved the performance. >>>>> The performance degradation in PATCH v2 mainly caused by the page_fault >>>>> produced by the function compress2(). >>>>> >>>>> I wrote some codes to test the performance of compress2. It almost costs >>>>> the same time and produces the same amount of page_fault as executing compress2 >>>>> in thread. >>>>> >>>>> To reduce page_faults, I have to do the following in kdump_thread_function_cyclic(). >>>>> >>>>> + /* >>>>> + * lock memory to reduce page_faults by compress2() >>>>> + */ >>>>> + void *temp = malloc(1); >>>>> + memset(temp, 0, 1); >>>>> + mlockall(MCL_CURRENT); >>>>> + free(temp); >>>>> + >>>>> >>>>> With this, using a thread or not almost has the same performance. >>>> >>>> Hmm... I can't get good results with this patch, many page faults still >>>> occur. I guess mlock will change when page faults occur, but will not >>>> change the total number of page faults. >>>> Could you explain why compress2() causes many page faults only in thread, >>>> then I may understand why this patch is meaningful. >>>> >>> >>> Actually, it will also cause so much page faults even not in thread, if >>> info->bitmap2 is not freed in makedumpfile. >>> >>> I wrote some codes to test the performance of compress2(). >>> >>> <cut> >>> buf = malloc(PAGE_SIZE); >>> bufout = malloc(SIZE_OUT); >>> memset(buf, 1, PAGE_SIZE / 2); >>> while (1) >>> compress2(bufout, &size_out, buf, PAGE_SIZE, Z_BEST_SPEED); >>> <cut> >>> >>> The codes almost like this. >>> It will cause much page faults. >>> >>> But if the codes turn to be the following, it will be much better. >>> >>> <cut> >>> temp = malloc(TEMP_SIZE); >>> memset(temp, 0, TEMP_SIZE); >>> free(temp); >>> >>> buf = malloc(PAGE_SIZE); >>> bufout = malloc(SIZE_OUT); >>> memset(buf, 1, PAGE_SIZE / 2); >>> while (1) >>> compress2(bufout, &size_out, buf, PAGE_SIZE, Z_BEST_SPEED); >>> <cut> >>> >>> TEMP_SIZE must be large enough. >>> (larger than 135097 will work,in my machine) >>> >>> >>> If in thread, the following codes can reduce the page faults. >>> >>> <cut> >>> temp = malloc(1); >>> memset(temp, 0, 1); >>> mlockall(MCL_CURRENT); >>> free(temp); >>> >>> buf = malloc(PAGE_SIZE); >>> bufout = malloc(SIZE_OUT); >>> memset(buf, 1, PAGE_SIZE / 2); >>> while (1) >>> compress2(bufout, &size_out, buf, PAGE_SIZE, Z_BEST_SPEED); >>> <cut> >>> >>> I haven't known why. >> >> I assume that we are facing the known issue of glibc: >> >> https://sourceware.org/ml/libc-alpha/2015-03/msg00270.html >> >> According to the thread above, per-thread arena is easy to be grown and >> trimmed compared with main arena. >> Actually compress2() calls malloc() and free() for compression each time >> it is called, so every compression processing will cause page fault. >> Moreover, I confirmed that many madvise(MADV_DONTNEED) are invoked only >> when compress2() is called in thread. >> >> OTOH, in lzo case, a temp buffer for working is allocated on the caller >> side, so it can reduce the number of malloc()/free() pair. >> (but I'm not sure why snappy doesn't hit this issue. The buffer size >> for compression may be smaller than the trim threshold.) >> >> Anyway, basically it's hard for zlib to avoid this issue on the application >> side, it seems that we have to accept the performance degradation caused by it. >> Unfortunately, the main target of this multi thread feature is zlib as you >> measured, we should resolve this issue somehow. >> >> Nevertheless, even now we can get some benefit of parallel processing, >> so lets' start to discuss the implementation of the parallel processing >> feature to accept this patch. I have some comments: >> >> - read_pfn_parallel() doesn't use the cache feature(cache.c), is it >> intentional with you ? >> > > Yes, since the data are read once a page here, cache feature seems not > needed. > >> - Now --num-buffers is tunable but the man description and your benchmark >> didn't mention what is the benefit of this parameter. >> > > The default value of num-buffers is 50. Originally the value has great influence > on the performance. But since we changed the logic in the 2nd version of the > patch set, more buffers have little improvement(1000 buffers may have 1% improvement). > I'm considering if the option should be removed. what do you think about it? > > BTW, the code (mlockall) added in the 3rd version works well in several machines here. > Should I keep it ? > With the codes, madvise(MADV_DONTNEED) will be failed in compress2 and the performance > is as expected in these machines. > >