On Fri, 13 Mar 2015 04:10:22 +0000 Atsushi Kumagai <ats-kumagai at wm.jp.nec.com> wrote: > Hello, > > (Note: my email address has changed.) > > In x86_64, calling ioremap/iounmap per page in copy_oldmem_page() > causes big performance degradation, so mmap() was introduced on > /proc/vmcore. However, there is no big difference between read() and > mmap() in s390 since it doesn't need ioremap/iounmap in copy_oldmem_page(), > so other issues have been revealed, right? > > [...] > > >> I counted the mmap and read system calls with "perf stat": > >> > >> mmap unmap read = sum > >> =============================================== > >> mmap -d0 482 443 165 1090 > >> mmap -d31 13454 13414 165 27033 > >> non-mmap -d0 34 3 458917 458954 > >> non-mmap -d31 34 3 74273 74310 > > > >If your VM has 1.5 GiB of RAM, then the numbers for -d0 look > >reasonable. For -d31, we should be able to do better than this > >by allocating more cache slots and improving the algorithm. > >I originally didn't deem it worth the effort, but seeing almost > >30 times more mmaps than with -d0 may change my mind. > > Are you going to do it as v3 patch? No. Tuning the caching algorithm requires a lot of research. I plan to do it, but testing it with all scenarios (and tuning the algorithm based on the results) will probably take weeks. I don't think it makes sense to wait for it. > I'm going to release v1.5.8 soon, so I'll adopt v2 patch if > you don't think updating it. Since v2 already brings some performance gain, I appreciate it if you can adopt it for v1.5.8. Thank you very much, Petr Tesarik > Thanks > Atsushi Kumagai > > > > >> Here the actual results I got with "perf record": > >> > >> $ time ./makedumpfile -d 31 /proc/vmcore /dev/null -f > >> > >> Output of "perf report" for mmap case: > >> > >> /* Most time spent for unmap in kernel */ > >> 29.75% makedumpfile [kernel.kallsyms] [k] unmap_single_vma > >> 9.84% makedumpfile [kernel.kallsyms] [k] remap_pfn_range > >> 8.49% makedumpfile [kernel.kallsyms] [k] vm_normal_page > >> > >> /* Still some mmap overhead in makedumpfile readmem() */ > >> 21.56% makedumpfile makedumpfile [.] readmem > > > >This number is interesting. Did you compile makedumpfile with > >optimizations? If yes, then this number probably includes some > >functions which were inlined. > > > >> 8.49% makedumpfile makedumpfile [.] > >> write_kdump_pages_cyclic > >> > >> Output of "perf report" for non-mmap case: > >> > >> /* Time spent for sys_read (that needs also two copy operations > >> on s390 :() */ 25.32% makedumpfile [kernel.kallsyms] [k] > >> memcpy_real 22.74% makedumpfile [kernel.kallsyms] [k] > >> __copy_to_user > >> > >> /* readmem() for read path is cheaper ? */ > >> 13.49% makedumpfile makedumpfile [.] > >> write_kdump_pages_cyclic 4.53% makedumpfile makedumpfile > >> [.] readmem > > > >Yes, much lower overhead of readmem is strange. For a moment I > >suspected wrong accounting of the page fault handler, but then I > >realized that for /proc/vmcore, all page table entries are created > >with the present bit set already, so there are no page faults... > > > >I haven't had time yet to set up a system for reproduction, but I'll > >try to identify what's eating up the CPU time in readmem(). > > > >>[...] > >> I hope this analysis helps more than it confuses :-) > >> > >> As a conclusion, we could think of mapping larger chunks > >> also for the fragmented case of -d 31 to reduce the amount > >> of mmap/munmap calls. > > > >I agree in general. Memory mapped through /proc/vmcore does not > >increase run-time memory requirements, because it only adds a mapping > >to the old kernel's memory. The only limiting factor is the virtual > >address space. On many architectures, this is no issue at all, and we > >could simply map the whole file at beginning. On some architectures, > >the virtual address space is smaller than possible physical RAM, so > >this approach would not work for them. > > > >> Another open question was why the mmap case consumes more CPU > >> time in readmem() than the read case. Our theory is that the > >> first memory access is slower because it is not in the HW > >> cache. For the mmap case userspace issues the first access (copy > >> to makdumpfile cache) and for the read case the kernel issues > >> the first access (memcpy_real/copy_to_user). Therefore the > >> cache miss is accounted to userspace for mmap and to kernel for > >> read. > > > >I have no idea how to measure this on s390. On x86_64 I would add > >some asm code to read TSC before and after the memory access > >instruction. I guess there is a similar counter on s390. Suggestions? > > > >> And last but not least, perhaps on s390 we could replace > >> the bounce buffer used for memcpy_real()/copy_to_user() by > >> some more inteligent solution. > > > >Which would then improve the non-mmap times even more, right? > > > >Petr T