Hello Petr, With your patches I now used "perf record" and "perf stat" to check where the CPU time is consumed for -d31 and -d0. For -d31 the read case is better and for -d0 the mmap case is better. $ time ./makedumpfile -d 31 /proc/vmcore /dev/null -f [--non-mmap] user sys = total ======================================= mmap 0.156 0.248 0.404 non-mmap 0.090 0.180 0.270 $ time ./makedumpfile -d 0 /proc/vmcore /dev/null -f [--non-mmap] user sys = total ====================================== mmap 0.637 0.018 0.655 non-mmap 0.275 1.153 1.428 As already said, we think the reason is that for -d0 we issue only a small number of mmap/munmap calls because the mmap chunks are larger than the read chunks. For -d31 memory is fragmented and we issue lots of small mmap/munmap calls. Because munmap (at least on s390) is a very expensive operation and we need two calls (mmap/munmap), the mmap mode is slower that the read mode. I counted the mmap and read system calls with "perf stat": mmap unmap read = sum =============================================== mmap -d0 482 443 165 1090 mmap -d31 13454 13414 165 27033 non-mmap -d0 34 3 458917 458954 non-mmap -d31 34 3 74273 74310 Here the actual results I got with "perf record": $ time ./makedumpfile -d 31 /proc/vmcore /dev/null -f Output of "perf report" for mmap case: /* Most time spent for unmap in kernel */ 29.75% makedumpfile [kernel.kallsyms] [k] unmap_single_vma 9.84% makedumpfile [kernel.kallsyms] [k] remap_pfn_range 8.49% makedumpfile [kernel.kallsyms] [k] vm_normal_page /* Still some mmap overhead in makedumpfile readmem() */ 21.56% makedumpfile makedumpfile [.] readmem 8.49% makedumpfile makedumpfile [.] write_kdump_pages_cyclic Output of "perf report" for non-mmap case: /* Time spent for sys_read (that needs also two copy operations on s390 :() */ 25.32% makedumpfile [kernel.kallsyms] [k] memcpy_real 22.74% makedumpfile [kernel.kallsyms] [k] __copy_to_user /* readmem() for read path is cheaper ? */ 13.49% makedumpfile makedumpfile [.] write_kdump_pages_cyclic 4.53% makedumpfile makedumpfile [.] readmem $ time ./makedumpfile -d 0 /proc/vmcore /dev/null -f Output of "perf report" for mmap case: /* Almost no kernel time because we issue very view system calls */ 0.61% makedumpfile [kernel.kallsyms] [k] unmap_single_vma 0.61% makedumpfile [kernel.kallsyms] [k] sysc_do_svc /* Almost all time consumed in user space */ 84.64% makedumpfile makedumpfile [.] readmem 8.82% makedumpfile makedumpfile [.] write_cache Output of "perf report" for non-mmap case: /* Time spent for sys_read (that needs also two copy operations on s390) */ 31.50% makedumpfile [kernel.kallsyms] [k] memcpy_real 29.33% makedumpfile [kernel.kallsyms] [k] __copy_to_user /* Very little user space time */ 3.87% makedumpfile makedumpfile [.] write_cache 3.82% makedumpfile makedumpfile [.] readmem I hope this analysis helps more than it confuses :-) As a conclusion, we could think of mapping larger chunks also for the fragmented case of -d 31 to reduce the amount of mmap/munmap calls. Another open question was why the mmap case consumes more CPU time in readmem() than the read case. Our theory is that the first memory access is slower because it is not in the HW cache. For the mmap case userspace issues the first access (copy to makdumpfile cache) and for the read case the kernel issues the first access (memcpy_real/copy_to_user). Therefore the cache miss is accounted to userspace for mmap and to kernel for read. And last but not least, perhaps on s390 we could replace the bounce buffer used for memcpy_real()/copy_to_user() by some more inteligent solution. Best Regards Michael On Fri, 6 Mar 2015 15:03:12 +0100 Petr Tesarik <ptesarik at suse.cz> wrote: > Because all pages must go into the cache, data is unnecessarily > copied from mmapped regions to cache. Avoid this copying by storing > the mmapped regions directly in the cache. > > First, the cache code needs a clean up clarification of the concept, > especially the meaning of the pending list (allocated cache entries > whose content is not yet valid). > > Second, the cache must be able to handle differently sized objects > so that it can store individual pages as well as mmapped regions. > > Last, the cache eviction code must be extended to allow either > reusing the read buffer or unmapping the region. > > Changelog: > v2: cache cleanup _and_ actuall mmap implementation > v1: only the cache cleanup > > Petr Tesarik (8): > cache: get rid of search loop in cache_add() > cache: allow to return a page to the pool > cache: do not allocate from the pending list > cache: add hit/miss statistics to the final report > cache: allocate buffers in one big chunk > cache: allow arbitrary size of cache entries > cache: store mapped regions directly in the cache > cleanup: remove unused page_is_fractional > > cache.c | 81 +++++++++++++++++---------------- > cache.h | 16 +++++-- > elf_info.c | 16 ------- > elf_info.h | 2 - > makedumpfile.c | 138 ++++++++++++++++++++++++++++++++++----------------------- > 5 files changed, 138 insertions(+), 115 deletions(-) >