>On Fri, 13 Mar 2015 04:10:22 +0000 >Atsushi Kumagai <ats-kumagai at wm.jp.nec.com> wrote: > >> Hello, >> >> (Note: my email address has changed.) >> >> In x86_64, calling ioremap/iounmap per page in copy_oldmem_page() >> causes big performance degradation, so mmap() was introduced on >> /proc/vmcore. However, there is no big difference between read() and >> mmap() in s390 since it doesn't need ioremap/iounmap in copy_oldmem_page(), >> so other issues have been revealed, right? >> >> [...] >> >> >> I counted the mmap and read system calls with "perf stat": >> >> >> >> mmap unmap read = sum >> >> =============================================== >> >> mmap -d0 482 443 165 1090 >> >> mmap -d31 13454 13414 165 27033 >> >> non-mmap -d0 34 3 458917 458954 >> >> non-mmap -d31 34 3 74273 74310 >> > >> >If your VM has 1.5 GiB of RAM, then the numbers for -d0 look >> >reasonable. For -d31, we should be able to do better than this >> >by allocating more cache slots and improving the algorithm. >> >I originally didn't deem it worth the effort, but seeing almost >> >30 times more mmaps than with -d0 may change my mind. >> >> Are you going to do it as v3 patch? > >No. Tuning the caching algorithm requires a lot of research. I plan to >do it, but testing it with all scenarios (and tuning the algorithm >based on the results) will probably take weeks. I don't think it makes >sense to wait for it. > >> I'm going to release v1.5.8 soon, so I'll adopt v2 patch if >> you don't think updating it. > >Since v2 already brings some performance gain, I appreciate it if you >can adopt it for v1.5.8. Ok, but unfortunately I got some error log during my test like below: $ ./makedumpfile -d31 /tmp/vmcore ./dumpfile.d31 Excluding free pages : [ 0.0 %] / reset_bitmap_of_free_pages: The free list is broken. reset_bitmap_of_free_pages: The free list is broken. makedumpfile Failed. $ All of errors are the same as the above at least in my test. I clarified that [PATCH v2 7/8] causes this by git bisect, but the root cause is under investigation. 4064 if (!readmem(VADDR, curr+OFFSET(list_head.prev), 4065 &curr_prev, sizeof curr_prev)) { // get wrong value here 4066 ERRMSG("Can't get prev list_head.\n"); 4067 return FALSE; 4068 } 4069 if (previous != curr_prev) { 4070 ERRMSG("The free list is broken.\n"); 4071 retcd = ANALYSIS_FAILED; 4072 return FALSE; 4073 } Thanks Atsushi Kumagai >Thank you very much, >Petr Tesarik > >> Thanks >> Atsushi Kumagai >> >> > >> >> Here the actual results I got with "perf record": >> >> >> >> $ time ./makedumpfile -d 31 /proc/vmcore /dev/null -f >> >> >> >> Output of "perf report" for mmap case: >> >> >> >> /* Most time spent for unmap in kernel */ >> >> 29.75% makedumpfile [kernel.kallsyms] [k] unmap_single_vma >> >> 9.84% makedumpfile [kernel.kallsyms] [k] remap_pfn_range >> >> 8.49% makedumpfile [kernel.kallsyms] [k] vm_normal_page >> >> >> >> /* Still some mmap overhead in makedumpfile readmem() */ >> >> 21.56% makedumpfile makedumpfile [.] readmem >> > >> >This number is interesting. Did you compile makedumpfile with >> >optimizations? If yes, then this number probably includes some >> >functions which were inlined. >> > >> >> 8.49% makedumpfile makedumpfile [.] >> >> write_kdump_pages_cyclic >> >> >> >> Output of "perf report" for non-mmap case: >> >> >> >> /* Time spent for sys_read (that needs also two copy operations >> >> on s390 :() */ 25.32% makedumpfile [kernel.kallsyms] [k] >> >> memcpy_real 22.74% makedumpfile [kernel.kallsyms] [k] >> >> __copy_to_user >> >> >> >> /* readmem() for read path is cheaper ? */ >> >> 13.49% makedumpfile makedumpfile [.] >> >> write_kdump_pages_cyclic 4.53% makedumpfile makedumpfile >> >> [.] readmem >> > >> >Yes, much lower overhead of readmem is strange. For a moment I >> >suspected wrong accounting of the page fault handler, but then I >> >realized that for /proc/vmcore, all page table entries are created >> >with the present bit set already, so there are no page faults... >> > >> >I haven't had time yet to set up a system for reproduction, but I'll >> >try to identify what's eating up the CPU time in readmem(). >> > >> >>[...] >> >> I hope this analysis helps more than it confuses :-) >> >> >> >> As a conclusion, we could think of mapping larger chunks >> >> also for the fragmented case of -d 31 to reduce the amount >> >> of mmap/munmap calls. >> > >> >I agree in general. Memory mapped through /proc/vmcore does not >> >increase run-time memory requirements, because it only adds a mapping >> >to the old kernel's memory. The only limiting factor is the virtual >> >address space. On many architectures, this is no issue at all, and we >> >could simply map the whole file at beginning. On some architectures, >> >the virtual address space is smaller than possible physical RAM, so >> >this approach would not work for them. >> > >> >> Another open question was why the mmap case consumes more CPU >> >> time in readmem() than the read case. Our theory is that the >> >> first memory access is slower because it is not in the HW >> >> cache. For the mmap case userspace issues the first access (copy >> >> to makdumpfile cache) and for the read case the kernel issues >> >> the first access (memcpy_real/copy_to_user). Therefore the >> >> cache miss is accounted to userspace for mmap and to kernel for >> >> read. >> > >> >I have no idea how to measure this on s390. On x86_64 I would add >> >some asm code to read TSC before and after the memory access >> >instruction. I guess there is a similar counter on s390. Suggestions? >> > >> >> And last but not least, perhaps on s390 we could replace >> >> the bounce buffer used for memcpy_real()/copy_to_user() by >> >> some more inteligent solution. >> > >> >Which would then improve the non-mmap times even more, right? >> > >> >Petr T