On Mon, 16 Mar 2015 05:14:25 +0000 Atsushi Kumagai <ats-kumagai at wm.jp.nec.com> wrote: > >On Fri, 13 Mar 2015 04:10:22 +0000 > >Atsushi Kumagai <ats-kumagai at wm.jp.nec.com> wrote: >[...] > >> I'm going to release v1.5.8 soon, so I'll adopt v2 patch if > >> you don't think updating it. > > > >Since v2 already brings some performance gain, I appreciate it if you > >can adopt it for v1.5.8. > > Ok, but unfortunately I got some error log during my test like below: > > $ ./makedumpfile -d31 /tmp/vmcore ./dumpfile.d31 > Excluding free pages : [ 0.0 %] / > reset_bitmap_of_free_pages: The free list is broken. > reset_bitmap_of_free_pages: The free list is broken. > > makedumpfile Failed. > $ > > All of errors are the same as the above at least in my test. > I clarified that [PATCH v2 7/8] causes this by git bisect, > but the root cause is under investigation. The only change I can think of is the removal of page_is_fractional. Originally, LOADs that do not start on a page boundary were never mmapped. With this patch, this check is removed. Can you try adding the following check to mappage_elf (and dropping patch 8/8)? if (page_is_fractional(offset)) return NULL; Petr T P.S. This reminds me I should try to get some kernel dumps with fractional pages for regression testing... > >Thank you very much, > >Petr Tesarik > > > >> Thanks > >> Atsushi Kumagai > >> > >> > > >> >> Here the actual results I got with "perf record": > >> >> > >> >> $ time ./makedumpfile -d 31 /proc/vmcore /dev/null -f > >> >> > >> >> Output of "perf report" for mmap case: > >> >> > >> >> /* Most time spent for unmap in kernel */ > >> >> 29.75% makedumpfile [kernel.kallsyms] [k] unmap_single_vma > >> >> 9.84% makedumpfile [kernel.kallsyms] [k] remap_pfn_range > >> >> 8.49% makedumpfile [kernel.kallsyms] [k] vm_normal_page > >> >> > >> >> /* Still some mmap overhead in makedumpfile readmem() */ > >> >> 21.56% makedumpfile makedumpfile [.] readmem > >> > > >> >This number is interesting. Did you compile makedumpfile with > >> >optimizations? If yes, then this number probably includes some > >> >functions which were inlined. > >> > > >> >> 8.49% makedumpfile makedumpfile [.] > >> >> write_kdump_pages_cyclic > >> >> > >> >> Output of "perf report" for non-mmap case: > >> >> > >> >> /* Time spent for sys_read (that needs also two copy > >> >> operations on s390 :() */ 25.32% makedumpfile > >> >> [kernel.kallsyms] [k] memcpy_real 22.74% makedumpfile > >> >> [kernel.kallsyms] [k] __copy_to_user > >> >> > >> >> /* readmem() for read path is cheaper ? */ > >> >> 13.49% makedumpfile makedumpfile [.] > >> >> write_kdump_pages_cyclic 4.53% makedumpfile makedumpfile > >> >> [.] readmem > >> > > >> >Yes, much lower overhead of readmem is strange. For a moment I > >> >suspected wrong accounting of the page fault handler, but then I > >> >realized that for /proc/vmcore, all page table entries are created > >> >with the present bit set already, so there are no page faults... > >> > > >> >I haven't had time yet to set up a system for reproduction, but > >> >I'll try to identify what's eating up the CPU time in readmem(). > >> > > >> >>[...] > >> >> I hope this analysis helps more than it confuses :-) > >> >> > >> >> As a conclusion, we could think of mapping larger chunks > >> >> also for the fragmented case of -d 31 to reduce the amount > >> >> of mmap/munmap calls. > >> > > >> >I agree in general. Memory mapped through /proc/vmcore does not > >> >increase run-time memory requirements, because it only adds a > >> >mapping to the old kernel's memory. The only limiting factor is > >> >the virtual address space. On many architectures, this is no > >> >issue at all, and we could simply map the whole file at > >> >beginning. On some architectures, the virtual address space is > >> >smaller than possible physical RAM, so this approach would not > >> >work for them. > >> > > >> >> Another open question was why the mmap case consumes more CPU > >> >> time in readmem() than the read case. Our theory is that the > >> >> first memory access is slower because it is not in the HW > >> >> cache. For the mmap case userspace issues the first access (copy > >> >> to makdumpfile cache) and for the read case the kernel issues > >> >> the first access (memcpy_real/copy_to_user). Therefore the > >> >> cache miss is accounted to userspace for mmap and to kernel for > >> >> read. > >> > > >> >I have no idea how to measure this on s390. On x86_64 I would add > >> >some asm code to read TSC before and after the memory access > >> >instruction. I guess there is a similar counter on s390. > >> >Suggestions? > >> > > >> >> And last but not least, perhaps on s390 we could replace > >> >> the bounce buffer used for memcpy_real()/copy_to_user() by > >> >> some more inteligent solution. > >> > > >> >Which would then improve the non-mmap times even more, right? > >> > > >> >Petr T > > _______________________________________________ > kexec mailing list > kexec at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec