[PATCH v2 0/8] Handle mmaped regions in cache [more analysis]

ats-kumagai@xxxxxxxxxxxxx (Atsushi Kumagai) · Mon, 16 Mar 2015 08:06:14 +0000

>On Mon, 16 Mar 2015 05:14:25 +0000
>Atsushi Kumagai <ats-kumagai at wm.jp.nec.com> wrote:
>
>> >On Fri, 13 Mar 2015 04:10:22 +0000
>> >Atsushi Kumagai <ats-kumagai at wm.jp.nec.com> wrote:
>>[...]
>> >> I'm going to release v1.5.8 soon, so I'll adopt v2 patch if
>> >> you don't think updating it.
>> >
>> >Since v2 already brings some performance gain, I appreciate it if you
>> >can adopt it for v1.5.8.
>>
>> Ok, but unfortunately I got some error log during my test like below:
>>
>>   $ ./makedumpfile -d31 /tmp/vmcore ./dumpfile.d31
>>   Excluding free pages               : [  0.0 %] /
>>   reset_bitmap_of_free_pages: The free list is broken.
>>   reset_bitmap_of_free_pages: The free list is broken.
>>
>>   makedumpfile Failed.
>>   $
>>
>> All of errors are the same as the above at least in my test.
>> I clarified that [PATCH v2 7/8] causes this by git bisect,
>> but the root cause is under investigation.
>
>The only change I can think of is the removal of page_is_fractional.
>Originally, LOADs that do not start on a page boundary were never
>mmapped. With this patch, this check is removed.
>
>Can you try adding the following check to mappage_elf (and dropping
>patch 8/8)?
>
>	if (page_is_fractional(offset))
>		return NULL;

It worked, thanks!

Additionally, I've remembered that we should keep page_is_fractional()
for old kernels.

  https://lkml.org/lkml/2013/11/13/439

Thanks
Atsushi Kumagai

>Petr T
>
>P.S. This reminds me I should try to get some kernel dumps with
>fractional pages for regression testing...
>
>> >Thank you very much,
>> >Petr Tesarik
>> >
>> >> Thanks
>> >> Atsushi Kumagai
>> >>
>> >> >
>> >> >> Here the actual results I got with "perf record":
>> >> >>
>> >> >> $ time ./makedumpfile  -d 31 /proc/vmcore  /dev/null -f
>> >> >>
>> >> >>   Output of "perf report" for mmap case:
>> >> >>
>> >> >>    /* Most time spent for unmap in kernel */
>> >> >>    29.75%  makedumpfile  [kernel.kallsyms]  [k] unmap_single_vma
>> >> >>     9.84%  makedumpfile  [kernel.kallsyms]  [k] remap_pfn_range
>> >> >>     8.49%  makedumpfile  [kernel.kallsyms]  [k] vm_normal_page
>> >> >>
>> >> >>    /* Still some mmap overhead in makedumpfile readmem() */
>> >> >>    21.56%  makedumpfile  makedumpfile       [.] readmem
>> >> >
>> >> >This number is interesting. Did you compile makedumpfile with
>> >> >optimizations? If yes, then this number probably includes some
>> >> >functions which were inlined.
>> >> >
>> >> >>     8.49%  makedumpfile  makedumpfile       [.]
>> >> >> write_kdump_pages_cyclic
>> >> >>
>> >> >>   Output of "perf report" for non-mmap case:
>> >> >>
>> >> >>    /* Time spent for sys_read (that needs also two copy
>> >> >> operations on s390 :() */ 25.32%  makedumpfile
>> >> >> [kernel.kallsyms]  [k] memcpy_real 22.74%  makedumpfile
>> >> >> [kernel.kallsyms]  [k] __copy_to_user
>> >> >>
>> >> >>    /* readmem() for read path is cheaper ? */
>> >> >>    13.49%  makedumpfile  makedumpfile       [.]
>> >> >> write_kdump_pages_cyclic 4.53%  makedumpfile  makedumpfile
>> >> >> [.] readmem
>> >> >
>> >> >Yes, much lower overhead of readmem is strange. For a moment I
>> >> >suspected wrong accounting of the page fault handler, but then I
>> >> >realized that for /proc/vmcore, all page table entries are created
>> >> >with the present bit set already, so there are no page faults...
>> >> >
>> >> >I haven't had time yet to set up a system for reproduction, but
>> >> >I'll try to identify what's eating up the CPU time in readmem().
>> >> >
>> >> >>[...]
>> >> >> I hope this analysis helps more than it confuses :-)
>> >> >>
>> >> >> As a conclusion, we could think of mapping larger chunks
>> >> >> also for the fragmented case of -d 31 to reduce the amount
>> >> >> of mmap/munmap calls.
>> >> >
>> >> >I agree in general. Memory mapped through /proc/vmcore does not
>> >> >increase run-time memory requirements, because it only adds a
>> >> >mapping to the old kernel's memory. The only limiting factor is
>> >> >the virtual address space. On many architectures, this is no
>> >> >issue at all, and we could simply map the whole file at
>> >> >beginning. On some architectures, the virtual address space is
>> >> >smaller than possible physical RAM, so this approach would not
>> >> >work for them.
>> >> >
>> >> >> Another open question was why the mmap case consumes more CPU
>> >> >> time in readmem() than the read case. Our theory is that the
>> >> >> first memory access is slower because it is not in the HW
>> >> >> cache. For the mmap case userspace issues the first access (copy
>> >> >> to makdumpfile cache) and for the read case the kernel issues
>> >> >> the first access (memcpy_real/copy_to_user). Therefore the
>> >> >> cache miss is accounted to userspace for mmap and to kernel for
>> >> >> read.
>> >> >
>> >> >I have no idea how to measure this on s390. On x86_64 I would add
>> >> >some asm code to read TSC before and after the memory access
>> >> >instruction. I guess there is a similar counter on s390.
>> >> >Suggestions?
>> >> >
>> >> >> And last but not least, perhaps on s390 we could replace
>> >> >> the bounce buffer used for memcpy_real()/copy_to_user() by
>> >> >> some more inteligent solution.
>> >> >
>> >> >Which would then improve the non-mmap times even more, right?
>> >> >
>> >> >Petr T
>>
>> _______________________________________________
>> kexec mailing list
>> kexec at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec