From: Vivek Goyal <vgoyal@xxxxxxxxxx> Subject: Re: [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region Date: Thu, 17 Jan 2013 17:13:48 -0500 > On Thu, Jan 10, 2013 at 08:59:34PM +0900, HATAYAMA Daisuke wrote: >> Currently, kdump reads the 1st kernel's memory, called old memory in >> the source code, using ioremap per a single page. This causes big >> performance degradation since page tables modification and tlb flush >> happen each time the single page is read. >> >> This issue turned out from Cliff's kernel-space filtering work. >> >> To avoid calling ioremap, we map a whole 1st kernel's memory targeted >> as vmcore regions in direct mapping table. By this we got big >> performance improvement. See the following simple benchmark. >> >> Machine spec: >> >> | CPU | Intel(R) Xeon(R) CPU E7- 4820 @ 2.00GHz (4 sockets, 8 cores) (*) | >> | Memory | 32 GB | >> | Kernel | 3.7 vanilla and with this patch set | >> >> (*) only 1 cpu is used in the 2nd kenrel now. >> >> Benchmark: >> >> I executed the following commands on the 2nd kernel and recorded real >> time. >> >> $ time dd bs=$((4096 * n)) if=/proc/vmcore of=/dev/null >> >> [3.7 vanilla] >> >> | block size | time | performance | >> | [KB] | | [MB/sec] | >> |------------+-----------+-------------| >> | 4 | 5m 46.97s | 93.56 | >> | 8 | 4m 20.68s | 124.52 | >> | 16 | 3m 37.85s | 149.01 | >> >> [3.7 with this patch] >> >> | block size | time | performance | >> | [KB] | | [GB/sec] | >> |------------+--------+-------------| >> | 4 | 17.59s | 1.85 | >> | 8 | 14.73s | 2.20 | >> | 16 | 14.26s | 2.28 | >> | 32 | 13.38s | 2.43 | >> | 64 | 12.77s | 2.54 | >> | 128 | 12.41s | 2.62 | >> | 256 | 12.50s | 2.60 | >> | 512 | 12.37s | 2.62 | >> | 1024 | 12.30s | 2.65 | >> | 2048 | 12.29s | 2.64 | >> | 4096 | 12.32s | 2.63 | >> > > These are impressive improvements. I missed the discussion on mmap(). > So why couldn't we provide mmap() interface for /proc/vmcore. If that > works then application can select to mmap/unmap bigger chunks of file > (instead ioremap mapping/remapping a page at a time). > > And if application controls the size of mapping, then it can vary the > size of mapping based on available amount of free memory. That way if > somebody reserves less amount of memory, we could still dump but with > some time penalty. > mmap() needs user-space page table in addition to kernel-space's, and it looks that remap_pfn_range() that creates the user-space page table, doesn't support large pages, only 4KB pages. If mmaping small chunks only for small memory programming, then we would again face the same issue as with ioremap. I don't know whether hugetlbfs supports mmap and 1GB page now. Another idea to reduce size of page table is to extend mapping ranges to cover a whole memory as many 1GB pages as possible. For example, supporse M is size of system memory, then total size of PGD and PUD pages to cover M is: ( 1 + roundup(M, 512GB) / 512GB ) * PAGE_SIZE ~ ~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ^ | | PGD page PUD pages Ideally, 2TB system can be covered with 20KB and 16TB with 132KB only. So I first want to evaluate this logic. Although I've not seen actually yet, I expect most of memory maps on tera-byte memory machines consists of 1GB-aligned huge chunks. Thanks. HATAYAMA, Daisuke