Dne Po 3. z??? 2012 09:04:03 Petr Tesarik napsal(a): > Dne Po 3. z??? 2012 05:42:33 Atsushi Kumagai napsal(a): > > Hello Petr, > > > > On Tue, 28 Aug 2012 19:49:49 +0200 > > > > Petr Tesarik <ptesarik at suse.cz> wrote: > > > Add a simple cache for pages read from the dumpfile. > > > > > > This is a big win if we read consecutive data from one page, e.g. > > > page descriptors, or even page table entries. > > > > > > Note that makedumpfile now always reads a complete page. This was > > > already the case with kdump-compressed and sadump formats, but > > > makedumpfile was throwing most of the data away. For the > > > kdump-compressed case, we may actually save a lot of decompression, > > > too. > > > > > > I tried to keep the cache small to minimize memory footprint, but it > > > should be big enough to hold all pages to do 4-level paging plus some > > > data. This is needed e.g. for vmalloc areas or Xen page frame table > > > data, which are not contiguous in physical memory. > > > > > > Signed-off-by: Petr Tesarik <ptesarik at suse.cz> > > > > It's interesting to me. I want to know how performance will be improved > > with this patch, so do you have speed measurements ? > > Not really. I only measured the hit/miss ratio, and with filtering Xen domU > and dump level 0, I got the following on a small system (2G RAM): > > cache hit: 1818880 cache miss: 1873 > > The improvement isn't much for non-Xen case, because the hits are mostly > due to virtual-to-physical translations, and most of Linux data is stored > at virtual addresses that can be resolved by adding/subtracting a fixed > offset. > > Of course, you will also win only the syscall overhead, because Linux keeps > the data in the kernel pagecache anyway. I'll measure the times for you on > a reasonably large system (~256G) and send the results here. I couldn't get a medium-sized system for testing, so I performed some measurements on a 64G system. I ran makedumpfile repeatedly from the kdump environment. First run was used to cache target filesystem metadata, and the cache was not dropped between runs to minimize effects of the target filesystem. I ran it against /proc/vmcore, i.e. the input file was always resident, nothing to skew the results. I tried with a kdump file with no compression (to get gzip/LZO out of the picture) and an ELF file. For the Xen case I only did the ELF file, because kdump is not available. First I ran it on bare metal. There was a slight improvement for -d31: kdump no cache: 6.32user 55.20system 1:15.60elapsed 81%CPU (0avgtext+0avgdata 4800maxresident)k 2080inputs+5714296outputs (2major+342minor)pagefaults 0swaps kdump with cache: 6.02user 24.58system 0:46.51elapsed 65%CPU (0avgtext+0avgdata 4912maxresident)k 1864inputs+5714288outputs (2major+350minor)pagefaults 0swaps ELF no cache: 7.58user 74.25system 1:59.52elapsed 68%CPU (0avgtext+0avgdata 4800maxresident)k 728inputs+9288824outputs (1major+342minor)pagefaults 0swaps ELF with cache: 7.43user 44.21system 1:17.41elapsed 66%CPU (0avgtext+0avgdata 4896maxresident)k 728inputs+9288792outputs (1major+349minor)pagefaults 0swaps To sum it up, I can see an improvement of approx. 50% in system time. The increase in memory consumption is a bit more than I would expect (why do I see ~100k for a cache of 12k?), but acceptable nevertheless. I can see a slight increase in user time (approx. 25%) for the kdump case, which could be attributed to the cache overhead. I don't have any explanation for the decreased user time for the ELF case, but it's consistent. I also tried running makedumpfile with -d1. This results in long sequential reads, so it's the worst case for a simple LRU-policy cache. The results are too unstable to make a reliable measurement, but there seems to be a slight performance hit. It is certainly less than 5% total time. I think there are two reasons for that: 1. We're copying file data twice for each page (once from the kernel page cache to the process space, and once from the internal cache to the destination). 2. Instead of reusing the same data location, we're rotating 8 different pages (or even up to twice as much if the allocated space is neither continuous nor page-aligned). This stresses both for the CPU's L1 d-cache and the TLB a tiny bit more. Note that in the /proc/vmcore case, the kernel sequentially maps all physical memory of the crashed system, so every cache page may be evicted before we get to using it again. This could explain why I observe an increase in system time despite making less system calls. There's a lot of things I could do to regain the old performance, if anybody is concerned about the slight performance regression for this worst case. Just let me know. Second, I ran with the Xen hypervisor. Since dump levels greater than 1 don't work, I ran with '-E -X -d1'. Even though this includes the inefficient page walk described above, the improvement was immense. no cache: 95.33user 657.18system 13:08.40elapsed 95%CPU (0avgtext+0avgdata 5440maxresident)k 704inputs+6563856outputs (1major+388minor)pagefaults 0swaps with cache: 61.14user 110.15system 3:24.24elapsed 83%CPU (0avgtext+0avgdata 5584maxresident)k 2360inputs+6563872outputs (2major+396minor)pagefaults 0swaps In short, almost 80% shorter total time. Petr Tesarik SUSE Linux