[LSF/MM TOPIC ][LSF/MM ATTEND] Read-only Mapping of Program Text using Large THP Pages

William Kucharski <william.kucharski@xxxxxxxxxx> · Wed, 20 Feb 2019 04:17:13 -0700

For the past year or so I have been working on further developing my original
prototype support of mapping read-only program text using large THP pages.

I developed a prototype described below which I continue to work on, but the
major issues I have yet to solve involve page cache integration and filesystem
support.

At present, the conventional methodology of reading a single base PAGE and
using readahead to fill in additional pages isn't useful as the entire (in my
prototype) PMD page needs to be read in before the page can be mapped (and at
that point it is unclear whether readahead of additional PMD sized pages would
be of benefit or too costly.

Additionally, there are no good interfaces at present to tell filesystem layers
that content is desired in chunks larger than a hardcoded limit of 64K, or to
to read disk blocks in chunks appropriate for PMD sized pages.

I very briefly discussed some of this work with Kirill in the past, and am
currently somewhat blocked on progress with my prototype due to issues with
multiorder page size support in the radix tree page cache. I don't feel it is
worth the time to debug those issues since the radix tree page cache is dead,
and it's much more useful to help Matthew Wilcox get multiorder page support
for XArray tested and approved upstream.

The following is a backgrounder on the work I have done to date and some
performance numbers.

Since it's just a prototype, I am unsure as to whether it would make a good topic
of a discussion talk per se, but should I be invited to attend it could
certainly engender a good amount of discussion as a BOF/cross-discipline topic
between the MM and FS tracks.

Thanks,
    William Kucharski

========================================

One of the downsides of THP as currently implemented is that it only supports
large page mappings for anonymous pages.

I embarked upon this prototype on the theory that it would be advantageous to 
be able to map large ranges of read-only text pages using THP as well.

The idea is that the kernel will attempt to allocate and map the range using a 
PMD sized THP page upon first fault; if the allocation is successful the page 
will be populated (at present using a call to kernel_read()) and the page will 
be mapped at the PMD level. If memory allocation fails, the page fault routines 
will drop through to the conventional PAGESIZE-oriented routines for mapping 
the faulting page.

Since this approach will map a PMD size block of the memory map at a time, we 
should see a slight uptick in time spent in disk I/O but a substantial drop in 
page faults as well as a reduction in iTLB misses as address ranges will be 
mapped with the larger page. Analysis of a test program that consists of a very 
large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this 
does occur and there is a slight reduction in program execution time.

The text segment as seen from readelf:

LOAD          0x0000000000000000 0x0000000000400000 0x0000000000400000
              0x000000001ccc19f0 0x000000001ccc19f0 R E    0x200000

As currently implemented for test purposes, the prototype will only use large 
pages to map an executable with a particular filename ("testr"), enabling easy 
comparison of the same executable using 4K and 2M (x64) pages on the same 
kernel. It is understood that this is just a proof of concept implementation 
and much more work regarding enabling the feature and overall system usage of 
it would need to be done before it was submitted as a kernel patch. However, I 
felt it would be worthy to send it out as an RFC so I can find out whether 
there are huge objections from the community to doing this at all, or a better 
understanding of the major concerns that must be assuaged before it would even 
be considered. I currently hardcode CONFIG_TRANSPARENT_HUGEPAGE to the 
equivalent of "always" and bypass some checks for anonymous pages by simply 
#ifdefing the code out; obviously I would need to determine the right thing to 
do in those cases.

Current comparisons of 4K vs 2M pages as generated by "perf stat -d -d -d -r10" 
follow; the 4K pagesize program was named "foo" and the 2M pagesize program 
"testr" (as noted above) - please note that these numbers do vary from run to 
run, but the orders of magnitude of the differences between the two versions 
remain relatively constant:

4K Pages:
=========
Performance counter stats for './foo' (10 runs):

  307054.450421      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  0.21% )
              0      context-switches:u        #    0.000 K/sec
              0      cpu-migrations:u          #    0.000 K/sec
          7,728      page-faults:u             #    0.025 K/sec                    ( +-  0.00% )
1,401,295,823,265      cycles:u                #    4.564 GHz                      ( +-  0.19% )  (30.77%)
562,704,668,718      instructions:u            #    0.40  insn per cycle           ( +-  0.00% )  (38.46%)
 20,100,243,102      branches:u                #   65.461 M/sec                    ( +-  0.00% )  (38.46%)
      2,628,944      branch-misses:u           #    0.01% of all branches          ( +-  3.32% )  (38.46%)
180,885,880,185      L1-dcache-loads:u         #  589.100 M/sec                    ( +-  0.00% )  (38.46%)
 40,374,420,279      L1-dcache-load-misses:u   #   22.32% of all L1-dcache hits    ( +-  0.01% )  (38.46%)
    232,184,583      LLC-loads:u               #    0.756 M/sec                    ( +-  1.48% )  (30.77%)
     23,990,082      LLC-load-misses:u         #   10.33% of all LL-cache hits     ( +-  1.48% )  (30.77%)
<not supported>      L1-icache-loads:u
 74,897,499,234      L1-icache-load-misses:u                                       ( +-  0.00% )  (30.77%)
180,990,026,447      dTLB-loads:u              #  589.440 M/sec                    ( +-  0.00% )  (30.77%)
        707,373      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   ( +-  4.62% )  (30.77%)
      5,583,675      iTLB-loads:u              #    0.018 M/sec                    ( +-  0.31% )  (30.77%)
  1,219,514,499      iTLB-load-misses:u        # 21840.71% of all iTLB cache hits  ( +-  0.01% )  (30.77%)
<not supported>      L1-dcache-prefetches:u
<not supported>      L1-dcache-prefetch-misses:u

307.093088771 seconds time elapsed                                          ( +-  0.20% )

2M Pages:
=========
Performance counter stats for './testr' (10 runs):

  289504.209769      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  0.19% )
              0      context-switches:u        #    0.000 K/sec
              0      cpu-migrations:u          #    0.000 K/sec
            598      page-faults:u             #    0.002 K/sec                    ( +-  0.03% )
1,323,835,488,984      cycles:u                #    4.573 GHz                      ( +-  0.19% )  (30.77%)
562,658,682,055      instructions:u            #    0.43  insn per cycle           ( +-  0.00% )  (38.46%)
 20,099,662,528      branches:u                #   69.428 M/sec                    ( +-  0.00% )  (38.46%)
      2,877,086      branch-misses:u           #    0.01% of all branches          ( +-  4.52% )  (38.46%)
180,899,297,017      L1-dcache-loads:u         #  624.859 M/sec                    ( +-  0.00% )  (38.46%)
 40,209,140,089      L1-dcache-load-misses:u   #   22.23% of all L1-dcache hits    ( +-  0.00% )  (38.46%)
    135,968,232      LLC-loads:u               #    0.470 M/sec                    ( +-  1.56% )  (30.77%)
      6,704,890      LLC-load-misses:u         #    4.93% of all LL-cache hits     ( +-  1.92% )  (30.77%)
<not supported>      L1-icache-loads:u
 74,955,673,747      L1-icache-load-misses:u                                       ( +-  0.00% )  (30.77%)
180,987,794,366      dTLB-loads:u              #  625.165 M/sec                    ( +-  0.00% )  (30.77%)
            835      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   ( +- 14.35% )  (30.77%)
      6,386,207      iTLB-loads:u              #    0.022 M/sec                    ( +-  0.42% )  (30.77%)
     51,929,869      iTLB-load-misses:u        #  813.16% of all iTLB cache hits   ( +-  1.61% )  (30.77%)
<not supported>      L1-dcache-prefetches:u
<not supported>      L1-dcache-prefetch-misses:u

289.551551387 seconds time elapsed                                          ( +-  0.20% )

A check of /proc/meminfo with the test program running shows the large mappings:

ShmemPmdMapped:   471040 kB

The obvious problem with this first swipe at things is the large pages are not
placed into the page cache, so for example multiple concurrent executions of the
test program allocate and map the large pages each time.

A greater architectural issue is the best way to support large pages in the page
cache, which is something Matthew Wilcox's multiorder page support in XArray
should solve.

Some questions:

* What is the best approach to deal with large pages when PAGESIZE mappings exist?
At present, the prototype evicts PAGESIZE pages from the page cache, replacing
them with a mapping for the large page, and future mappings of a PAGESIZE range
should map using an offset into the PMD sized physical page used to map the PMD
sized virtual page.

* Do we need to create per-filesystem routines to handle large pages or can
we delay that (ideally we would want to be able to read in the contents
of large pages without having to read_iter however many PAGESIZE pages
we need.)

I am happy to take whatever approach is best to add large pages to the page
cache, but it seems useful and crucuial that a way be provided for the system to
automatically use THP to map large text pages if so desired, read-only to begin
but eventually read/write to accommodate applications that self-modify code such
as databases and Java.

========================================