>On Mon, 16 Mar 2015 05:14:25 +0000 >Atsushi Kumagai <ats-kumagai at wm.jp.nec.com> wrote: > >> >On Fri, 13 Mar 2015 04:10:22 +0000 >> >Atsushi Kumagai <ats-kumagai at wm.jp.nec.com> wrote: >>[...] >> >> I'm going to release v1.5.8 soon, so I'll adopt v2 patch if >> >> you don't think updating it. >> > >> >Since v2 already brings some performance gain, I appreciate it if you >> >can adopt it for v1.5.8. >> >> Ok, but unfortunately I got some error log during my test like below: >> >> $ ./makedumpfile -d31 /tmp/vmcore ./dumpfile.d31 >> Excluding free pages : [ 0.0 %] / >> reset_bitmap_of_free_pages: The free list is broken. >> reset_bitmap_of_free_pages: The free list is broken. >> >> makedumpfile Failed. >> $ >> >> All of errors are the same as the above at least in my test. >> I clarified that [PATCH v2 7/8] causes this by git bisect, >> but the root cause is under investigation. > >The only change I can think of is the removal of page_is_fractional. >Originally, LOADs that do not start on a page boundary were never >mmapped. With this patch, this check is removed. > >Can you try adding the following check to mappage_elf (and dropping >patch 8/8)? > > if (page_is_fractional(offset)) > return NULL; It worked, thanks! Additionally, I've remembered that we should keep page_is_fractional() for old kernels. https://lkml.org/lkml/2013/11/13/439 Thanks Atsushi Kumagai >Petr T > >P.S. This reminds me I should try to get some kernel dumps with >fractional pages for regression testing... > >> >Thank you very much, >> >Petr Tesarik >> > >> >> Thanks >> >> Atsushi Kumagai >> >> >> >> > >> >> >> Here the actual results I got with "perf record": >> >> >> >> >> >> $ time ./makedumpfile -d 31 /proc/vmcore /dev/null -f >> >> >> >> >> >> Output of "perf report" for mmap case: >> >> >> >> >> >> /* Most time spent for unmap in kernel */ >> >> >> 29.75% makedumpfile [kernel.kallsyms] [k] unmap_single_vma >> >> >> 9.84% makedumpfile [kernel.kallsyms] [k] remap_pfn_range >> >> >> 8.49% makedumpfile [kernel.kallsyms] [k] vm_normal_page >> >> >> >> >> >> /* Still some mmap overhead in makedumpfile readmem() */ >> >> >> 21.56% makedumpfile makedumpfile [.] readmem >> >> > >> >> >This number is interesting. Did you compile makedumpfile with >> >> >optimizations? If yes, then this number probably includes some >> >> >functions which were inlined. >> >> > >> >> >> 8.49% makedumpfile makedumpfile [.] >> >> >> write_kdump_pages_cyclic >> >> >> >> >> >> Output of "perf report" for non-mmap case: >> >> >> >> >> >> /* Time spent for sys_read (that needs also two copy >> >> >> operations on s390 :() */ 25.32% makedumpfile >> >> >> [kernel.kallsyms] [k] memcpy_real 22.74% makedumpfile >> >> >> [kernel.kallsyms] [k] __copy_to_user >> >> >> >> >> >> /* readmem() for read path is cheaper ? */ >> >> >> 13.49% makedumpfile makedumpfile [.] >> >> >> write_kdump_pages_cyclic 4.53% makedumpfile makedumpfile >> >> >> [.] readmem >> >> > >> >> >Yes, much lower overhead of readmem is strange. For a moment I >> >> >suspected wrong accounting of the page fault handler, but then I >> >> >realized that for /proc/vmcore, all page table entries are created >> >> >with the present bit set already, so there are no page faults... >> >> > >> >> >I haven't had time yet to set up a system for reproduction, but >> >> >I'll try to identify what's eating up the CPU time in readmem(). >> >> > >> >> >>[...] >> >> >> I hope this analysis helps more than it confuses :-) >> >> >> >> >> >> As a conclusion, we could think of mapping larger chunks >> >> >> also for the fragmented case of -d 31 to reduce the amount >> >> >> of mmap/munmap calls. >> >> > >> >> >I agree in general. Memory mapped through /proc/vmcore does not >> >> >increase run-time memory requirements, because it only adds a >> >> >mapping to the old kernel's memory. The only limiting factor is >> >> >the virtual address space. On many architectures, this is no >> >> >issue at all, and we could simply map the whole file at >> >> >beginning. On some architectures, the virtual address space is >> >> >smaller than possible physical RAM, so this approach would not >> >> >work for them. >> >> > >> >> >> Another open question was why the mmap case consumes more CPU >> >> >> time in readmem() than the read case. Our theory is that the >> >> >> first memory access is slower because it is not in the HW >> >> >> cache. For the mmap case userspace issues the first access (copy >> >> >> to makdumpfile cache) and for the read case the kernel issues >> >> >> the first access (memcpy_real/copy_to_user). Therefore the >> >> >> cache miss is accounted to userspace for mmap and to kernel for >> >> >> read. >> >> > >> >> >I have no idea how to measure this on s390. On x86_64 I would add >> >> >some asm code to read TSC before and after the memory access >> >> >instruction. I guess there is a similar counter on s390. >> >> >Suggestions? >> >> > >> >> >> And last but not least, perhaps on s390 we could replace >> >> >> the bounce buffer used for memcpy_real()/copy_to_user() by >> >> >> some more inteligent solution. >> >> > >> >> >Which would then improve the non-mmap times even more, right? >> >> > >> >> >Petr T >> >> _______________________________________________ >> kexec mailing list >> kexec at lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/kexec