[LSF/MM TOPIC][LSF/MM ATTEND] Multiple Page Caches, Memory Tiering, Better LRU evictions,

"Michaud, Adrian" <Adrian.Michaud@xxxxxxxx> · Fri, 13 Jan 2017 21:49:14 +0000

I’d like to attend and propose one or all of the following topics at this year’s summit.

Multiple Page Caches (Software Enhancements)
--------------------------
Support for multiple page caches can provide many benefits to the kernel.
Different memory types can be put into different page caches. One page cache for native DDR system memory, another page cache for slower NV-DIMMs, etc.
General memory can be partitioned into several page caches of different sizes and could also be dedicated to high priority processes or used with containers to better isolate memory by dedicating a page cache to a cgroup process.

Each VMA, or process, could have a page cache identifier, or page alloc/free callbacks that allow individual VMAs or processes to specify which page cache they want to use.
Some VMAs might want anonymous memory backed by vast amounts of slower server class memory like NV-DIMMS.

Some processes or individual VMAs might want their own private page cache.
Each page cache can have its own eviction policy and low-water markers
Individual page caches could also have their own swap device.

Memory Tiering (Software Enhancements)
--------------------
Using multiple page caches, evictions from one page cache could be moved and remapped to another page cache instead of unmapped and written to swap.

If a system has 16GB of high speed DDR memory, and 64GB of slower memory, one could create a page cache with high speed DDR memory, another page cache with slower 64GB memory, and evict/copy/remap from the DDR page cache to the slow memory
 page cache. Evictions from the slow memory page cache would then get unmapped and written to swap.

Better LRU evictions (Software and Hardware Enhancements)
-------------------------
Add a page fault counter to the page struct to help colorize page demand.

We could suggest to Intel/AMD and other architecture leaders that TLB entries also have a translation counter (8-10 bits is sufficient) instead of just an “accessed” bit.  Scanning/clearing access bits is obviously inefficient; however,
 if TLBs had a translation counter instead of a single accessed bit then scanning and recording the amount of activity each TLB has would be significantly better and allow us to bettern calculate LRU pages for evictions.

TLB Shootdown (Hardware Enhancements)
--------------------------
We should stomp our feet and demand that TLB shootdowns should be hardware assisted in future architectures. Current TLB shootdown on x86 is horribly inefficient and obviously doesn’t scale. The QPI/UPI local bus protocol should provide
 TLB range invalidation broadcast so that a single CPU can concurrently notify other CPU/cores (with a selection mask) that a shared TLB entry has changed. Sending an IPI to each core is horribly inefficient; especially with the core counts increasing and the
 frequency of TLB unmapping/remapping also possibly increasing shortly with new server class memory extension technology.

Page Tables, Interrupt Descriptor Table, Global Descriptor table, etc (Software and Hardware Enhancements)
-----------------------------------------------------------------------------------
As small amounts of ultra-high speed memory on severs becomes available (For example: On-Package Memory from Intel), it would be good to utilize this memory initially for things like interrupt descriptor tables which we would like to always
 have the lowest latency, and possibly some or all of the page tables to allow faster TLB fetch/evictions as the frequency and latency of these directly affect overall load/store performance. Also, think about putting some of the highest frequently accessed
 kernel data into this ultra-high speed memory as well like current PID, etc. 

Over the last few years I’ve implemented all of these in a private kernel with the exception of the hardware enhancements mentioned above. With support for multiple page caches, multiple swap devices, individual page coloring, better LRU
 evictions, I’ve realized up to 30% overall performance improvements when testing large memory exhausting applications like MongoDB with MMAPV1. I’ve also implemented transparent memory tiering using an Intel 3DXP DIMM simulator as a 2^nd tier of
 slower memory. I’d love to discuss everything I’ve done in this space and see if there is interest in moving some of this into the mainline kernel or if I could offer help with similar efforts that might already be active.

Thanks,

Adrian Michaud