For the past year or so I have been working on further developing my original prototype support of mapping read-only program text using large THP pages. I developed a prototype described below which I continue to work on, but the major issues I have yet to solve involve page cache integration and filesystem support. At present, the conventional methodology of reading a single base PAGE and using readahead to fill in additional pages isn't useful as the entire (in my prototype) PMD page needs to be read in before the page can be mapped (and at that point it is unclear whether readahead of additional PMD sized pages would be of benefit or too costly. Additionally, there are no good interfaces at present to tell filesystem layers that content is desired in chunks larger than a hardcoded limit of 64K, or to to read disk blocks in chunks appropriate for PMD sized pages. I very briefly discussed some of this work with Kirill in the past, and am currently somewhat blocked on progress with my prototype due to issues with multiorder page size support in the radix tree page cache. I don't feel it is worth the time to debug those issues since the radix tree page cache is dead, and it's much more useful to help Matthew Wilcox get multiorder page support for XArray tested and approved upstream. The following is a backgrounder on the work I have done to date and some performance numbers. Since it's just a prototype, I am unsure as to whether it would make a good topic of a discussion talk per se, but should I be invited to attend it could certainly engender a good amount of discussion as a BOF/cross-discipline topic between the MM and FS tracks. Thanks, William Kucharski ======================================== One of the downsides of THP as currently implemented is that it only supports large page mappings for anonymous pages. I embarked upon this prototype on the theory that it would be advantageous to be able to map large ranges of read-only text pages using THP as well. The idea is that the kernel will attempt to allocate and map the range using a PMD sized THP page upon first fault; if the allocation is successful the page will be populated (at present using a call to kernel_read()) and the page will be mapped at the PMD level. If memory allocation fails, the page fault routines will drop through to the conventional PAGESIZE-oriented routines for mapping the faulting page. Since this approach will map a PMD size block of the memory map at a time, we should see a slight uptick in time spent in disk I/O but a substantial drop in page faults as well as a reduction in iTLB misses as address ranges will be mapped with the larger page. Analysis of a test program that consists of a very large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this does occur and there is a slight reduction in program execution time. The text segment as seen from readelf: LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000 0x000000001ccc19f0 0x000000001ccc19f0 R E 0x200000 As currently implemented for test purposes, the prototype will only use large pages to map an executable with a particular filename ("testr"), enabling easy comparison of the same executable using 4K and 2M (x64) pages on the same kernel. It is understood that this is just a proof of concept implementation and much more work regarding enabling the feature and overall system usage of it would need to be done before it was submitted as a kernel patch. However, I felt it would be worthy to send it out as an RFC so I can find out whether there are huge objections from the community to doing this at all, or a better understanding of the major concerns that must be assuaged before it would even be considered. I currently hardcode CONFIG_TRANSPARENT_HUGEPAGE to the equivalent of "always" and bypass some checks for anonymous pages by simply #ifdefing the code out; obviously I would need to determine the right thing to do in those cases. Current comparisons of 4K vs 2M pages as generated by "perf stat -d -d -d -r10" follow; the 4K pagesize program was named "foo" and the 2M pagesize program "testr" (as noted above) - please note that these numbers do vary from run to run, but the orders of magnitude of the differences between the two versions remain relatively constant: 4K Pages: ========= Performance counter stats for './foo' (10 runs): 307054.450421 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.21% ) 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 7,728 page-faults:u # 0.025 K/sec ( +- 0.00% ) 1,401,295,823,265 cycles:u # 4.564 GHz ( +- 0.19% ) (30.77%) 562,704,668,718 instructions:u # 0.40 insn per cycle ( +- 0.00% ) (38.46%) 20,100,243,102 branches:u # 65.461 M/sec ( +- 0.00% ) (38.46%) 2,628,944 branch-misses:u # 0.01% of all branches ( +- 3.32% ) (38.46%) 180,885,880,185 L1-dcache-loads:u # 589.100 M/sec ( +- 0.00% ) (38.46%) 40,374,420,279 L1-dcache-load-misses:u # 22.32% of all L1-dcache hits ( +- 0.01% ) (38.46%) 232,184,583 LLC-loads:u # 0.756 M/sec ( +- 1.48% ) (30.77%) 23,990,082 LLC-load-misses:u # 10.33% of all LL-cache hits ( +- 1.48% ) (30.77%) <not supported> L1-icache-loads:u 74,897,499,234 L1-icache-load-misses:u ( +- 0.00% ) (30.77%) 180,990,026,447 dTLB-loads:u # 589.440 M/sec ( +- 0.00% ) (30.77%) 707,373 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 4.62% ) (30.77%) 5,583,675 iTLB-loads:u # 0.018 M/sec ( +- 0.31% ) (30.77%) 1,219,514,499 iTLB-load-misses:u # 21840.71% of all iTLB cache hits ( +- 0.01% ) (30.77%) <not supported> L1-dcache-prefetches:u <not supported> L1-dcache-prefetch-misses:u 307.093088771 seconds time elapsed ( +- 0.20% ) 2M Pages: ========= Performance counter stats for './testr' (10 runs): 289504.209769 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.19% ) 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 598 page-faults:u # 0.002 K/sec ( +- 0.03% ) 1,323,835,488,984 cycles:u # 4.573 GHz ( +- 0.19% ) (30.77%) 562,658,682,055 instructions:u # 0.43 insn per cycle ( +- 0.00% ) (38.46%) 20,099,662,528 branches:u # 69.428 M/sec ( +- 0.00% ) (38.46%) 2,877,086 branch-misses:u # 0.01% of all branches ( +- 4.52% ) (38.46%) 180,899,297,017 L1-dcache-loads:u # 624.859 M/sec ( +- 0.00% ) (38.46%) 40,209,140,089 L1-dcache-load-misses:u # 22.23% of all L1-dcache hits ( +- 0.00% ) (38.46%) 135,968,232 LLC-loads:u # 0.470 M/sec ( +- 1.56% ) (30.77%) 6,704,890 LLC-load-misses:u # 4.93% of all LL-cache hits ( +- 1.92% ) (30.77%) <not supported> L1-icache-loads:u 74,955,673,747 L1-icache-load-misses:u ( +- 0.00% ) (30.77%) 180,987,794,366 dTLB-loads:u # 625.165 M/sec ( +- 0.00% ) (30.77%) 835 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 14.35% ) (30.77%) 6,386,207 iTLB-loads:u # 0.022 M/sec ( +- 0.42% ) (30.77%) 51,929,869 iTLB-load-misses:u # 813.16% of all iTLB cache hits ( +- 1.61% ) (30.77%) <not supported> L1-dcache-prefetches:u <not supported> L1-dcache-prefetch-misses:u 289.551551387 seconds time elapsed ( +- 0.20% ) A check of /proc/meminfo with the test program running shows the large mappings: ShmemPmdMapped: 471040 kB The obvious problem with this first swipe at things is the large pages are not placed into the page cache, so for example multiple concurrent executions of the test program allocate and map the large pages each time. A greater architectural issue is the best way to support large pages in the page cache, which is something Matthew Wilcox's multiorder page support in XArray should solve. Some questions: * What is the best approach to deal with large pages when PAGESIZE mappings exist? At present, the prototype evicts PAGESIZE pages from the page cache, replacing them with a mapping for the large page, and future mappings of a PAGESIZE range should map using an offset into the PMD sized physical page used to map the PMD sized virtual page. * Do we need to create per-filesystem routines to handle large pages or can we delay that (ideally we would want to be able to read in the contents of large pages without having to read_iter however many PAGESIZE pages we need.) I am happy to take whatever approach is best to add large pages to the page cache, but it seems useful and crucuial that a way be provided for the system to automatically use THP to map large text pages if so desired, read-only to begin but eventually read/write to accommodate applications that self-modify code such as databases and Java. ========================================