[PATCH v2 0/8] Handle mmaped regions in cache [more analysis]

ats-kumagai@xxxxxxxxxxxxx (Atsushi Kumagai) · Mon, 16 Mar 2015 05:14:25 +0000

>On Fri, 13 Mar 2015 04:10:22 +0000
>Atsushi Kumagai <ats-kumagai at wm.jp.nec.com> wrote:
>
>> Hello,
>>
>> (Note: my email address has changed.)
>>
>> In x86_64, calling ioremap/iounmap per page in copy_oldmem_page()
>> causes big performance degradation, so mmap() was introduced on
>> /proc/vmcore. However, there is no big difference between read() and
>> mmap() in s390 since it doesn't need ioremap/iounmap in copy_oldmem_page(),
>> so other issues have been revealed, right?
>>
>> [...]
>>
>> >> I counted the mmap and read system calls with "perf stat":
>> >>
>> >>                      mmap   unmap   read =    sum
>> >>   ===============================================
>> >>   mmap -d0            482     443    165     1090
>> >>   mmap -d31         13454   13414    165    27033
>> >>   non-mmap -d0         34       3 458917   458954
>> >>   non-mmap -d31        34       3  74273    74310
>> >
>> >If your VM has 1.5 GiB of RAM, then the numbers for -d0 look
>> >reasonable. For -d31, we should be able to do better than this
>> >by allocating more cache slots and improving the algorithm.
>> >I originally didn't deem it worth the effort, but seeing almost
>> >30 times more mmaps than with -d0 may change my mind.
>>
>> Are you going to do it as v3 patch?
>
>No. Tuning the caching algorithm requires a lot of research. I plan to
>do it, but testing it with all scenarios (and tuning the algorithm
>based on the results) will probably take weeks. I don't think it makes
>sense to wait for it.
>
>> I'm going to release v1.5.8 soon, so I'll adopt v2 patch if
>> you don't think updating it.
>
>Since v2 already brings some performance gain, I appreciate it if you
>can adopt it for v1.5.8.

Ok, but unfortunately I got some error log during my test like below:

  $ ./makedumpfile -d31 /tmp/vmcore ./dumpfile.d31
  Excluding free pages               : [  0.0 %] /
  reset_bitmap_of_free_pages: The free list is broken.
  reset_bitmap_of_free_pages: The free list is broken.

  makedumpfile Failed.
  $

All of errors are the same as the above at least in my test.
I clarified that [PATCH v2 7/8] causes this by git bisect,
but the root cause is under investigation.

   4064                 if (!readmem(VADDR, curr+OFFSET(list_head.prev),
   4065                              &curr_prev, sizeof curr_prev)) {     // get wrong value here
   4066                         ERRMSG("Can't get prev list_head.\n");
   4067                         return FALSE;
   4068                 }
   4069                 if (previous != curr_prev) {
   4070                         ERRMSG("The free list is broken.\n");
   4071                         retcd = ANALYSIS_FAILED;
   4072                         return FALSE;
   4073                 }

Thanks
Atsushi Kumagai

>Thank you very much,
>Petr Tesarik
>
>> Thanks
>> Atsushi Kumagai
>>
>> >
>> >> Here the actual results I got with "perf record":
>> >>
>> >> $ time ./makedumpfile  -d 31 /proc/vmcore  /dev/null -f
>> >>
>> >>   Output of "perf report" for mmap case:
>> >>
>> >>    /* Most time spent for unmap in kernel */
>> >>    29.75%  makedumpfile  [kernel.kallsyms]  [k] unmap_single_vma
>> >>     9.84%  makedumpfile  [kernel.kallsyms]  [k] remap_pfn_range
>> >>     8.49%  makedumpfile  [kernel.kallsyms]  [k] vm_normal_page
>> >>
>> >>    /* Still some mmap overhead in makedumpfile readmem() */
>> >>    21.56%  makedumpfile  makedumpfile       [.] readmem
>> >
>> >This number is interesting. Did you compile makedumpfile with
>> >optimizations? If yes, then this number probably includes some
>> >functions which were inlined.
>> >
>> >>     8.49%  makedumpfile  makedumpfile       [.]
>> >> write_kdump_pages_cyclic
>> >>
>> >>   Output of "perf report" for non-mmap case:
>> >>
>> >>    /* Time spent for sys_read (that needs also two copy operations
>> >> on s390 :() */ 25.32%  makedumpfile  [kernel.kallsyms]  [k]
>> >> memcpy_real 22.74%  makedumpfile  [kernel.kallsyms]  [k]
>> >> __copy_to_user
>> >>
>> >>    /* readmem() for read path is cheaper ? */
>> >>    13.49%  makedumpfile  makedumpfile       [.]
>> >> write_kdump_pages_cyclic 4.53%  makedumpfile  makedumpfile
>> >> [.] readmem
>> >
>> >Yes, much lower overhead of readmem is strange. For a moment I
>> >suspected wrong accounting of the page fault handler, but then I
>> >realized that for /proc/vmcore, all page table entries are created
>> >with the present bit set already, so there are no page faults...
>> >
>> >I haven't had time yet to set up a system for reproduction, but I'll
>> >try to identify what's eating up the CPU time in readmem().
>> >
>> >>[...]
>> >> I hope this analysis helps more than it confuses :-)
>> >>
>> >> As a conclusion, we could think of mapping larger chunks
>> >> also for the fragmented case of -d 31 to reduce the amount
>> >> of mmap/munmap calls.
>> >
>> >I agree in general. Memory mapped through /proc/vmcore does not
>> >increase run-time memory requirements, because it only adds a mapping
>> >to the old kernel's memory. The only limiting factor is the virtual
>> >address space. On many architectures, this is no issue at all, and we
>> >could simply map the whole file at beginning. On some architectures,
>> >the virtual address space is smaller than possible physical RAM, so
>> >this approach would not work for them.
>> >
>> >> Another open question was why the mmap case consumes more CPU
>> >> time in readmem() than the read case. Our theory is that the
>> >> first memory access is slower because it is not in the HW
>> >> cache. For the mmap case userspace issues the first access (copy
>> >> to makdumpfile cache) and for the read case the kernel issues
>> >> the first access (memcpy_real/copy_to_user). Therefore the
>> >> cache miss is accounted to userspace for mmap and to kernel for
>> >> read.
>> >
>> >I have no idea how to measure this on s390. On x86_64 I would add
>> >some asm code to read TSC before and after the memory access
>> >instruction. I guess there is a similar counter on s390. Suggestions?
>> >
>> >> And last but not least, perhaps on s390 we could replace
>> >> the bounce buffer used for memcpy_real()/copy_to_user() by
>> >> some more inteligent solution.
>> >
>> >Which would then improve the non-mmap times even more, right?
>> >
>> >Petr T