RE: [LSF/MM TOPIC][LSF/MM ATTEND] Multiple Page Caches, Memory Tiering, Better LRU evictions,

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



>-----Original Message-----
>From: owner-linux-mm@xxxxxxxxx [mailto:owner-linux-mm@xxxxxxxxx] On Behalf Of Kirill A. Shutemov
>Sent: Friday, January 13, 2017 6:57 PM
>To: Michaud, Adrian <Adrian.Michaud@xxxxxxx>
>Cc: lsf-pc@xxxxxxxxxxxxxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx
>Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] Multiple Page Caches, Memory Tiering, Better LRU evictions,
>
>On Fri, Jan 13, 2017 at 09:49:14PM +0000, Michaud, Adrian wrote:
>> I'd like to attend and propose one or all of the following topics at this year's summit.
>> 
>> Multiple Page Caches (Software Enhancements)
>> --------------------------
>> Support for multiple page caches can provide many benefits to the kernel.
>> Different memory types can be put into different page caches. One page 
>> cache for native DDR system memory, another page cache for slower 
>> NV-DIMMs, etc.
>> General memory can be partitioned into several page caches of 
>> different sizes and could also be dedicated to high priority processes 
>> or used with containers to better isolate memory by dedicating a page 
>> cache to a cgroup process.
>> Each VMA, or process, could have a page cache identifier, or page 
>> alloc/free callbacks that allow individual VMAs or processes to 
>> specify which page cache they want to use.
>> Some VMAs might want anonymous memory backed by vast amounts of slower 
>> server class memory like NV-DIMMS.
>> Some processes or individual VMAs might want their own private page 
>> cache.
>> Each page cache can have its own eviction policy and low-water markers 
>> Individual page caches could also have their own swap device.
>
>Sounds like you're re-inventing NUMA.
>What am I missing?


Think of separate isolated page caches. Each page cache can have a dedicated swap device if desired. Each page cache could have a different eviction policy (FIFO, LRU, custom-installed, etc.) Each page cache is fully isolated. You could have one page cache for the kernel and one or more page caches for individual processes or process groups if you want to fully isolate memory resources. You could dynamically create as many page caches as you like and dedicate or share them among applications. If you have a noisy neighbor, you could give them an appropriately sized dedicated page cache and they become fully bound to the size and eviction policy of that page cache. 


>
>> Memory Tiering (Software Enhancements)
>> --------------------
>> Using multiple page caches, evictions from one page cache could be 
>> moved and remapped to another page cache instead of unmapped and 
>> written to swap.
>> If a system has 16GB of high speed DDR memory, and 64GB of slower 
>> memory, one could create a page cache with high speed DDR memory, 
>> another page cache with slower 64GB memory, and evict/copy/remap from 
>> the DDR page cache to the slow memory page cache. Evictions from the 
> slow memory page cache would then get unmapped and written to swap.
>
>I guess it's something that can be done as part of NUMA balancing.

With support for multiple isolated page caches, you could simply tier them. If a page cache has a ->next tier, then evicted pages are allocated/copied/PTE remapped to the ->next page cache tier instead of unmapped and swapped out. Block I/O evictions only occur if the page cache doesn't have a ->next tier. 

>
>> Better LRU evictions (Software and Hardware Enhancements)
>> -------------------------
>> Add a page fault counter to the page struct to help colorize page demand.
>> We could suggest to Intel/AMD and other architecture leaders that TLB 
>> entries also have a translation counter (8-10 bits is sufficient) 
>> instead of just an "accessed" bit.  Scanning/clearing access bits is 
>> obviously inefficient; however, if TLBs had a translation counter 
>> instead of a single accessed bit then scanning and recording the 
>> amount of activity each TLB has would be significantly better and 
>> allow us to bettern calculate LRU pages for evictions.
>>
>>Except that would make memory accesses slower.
>>
>>Even access bit handing is noticible performance hit: processor has to write into page table entry on first access to the page.
>>What you're proposing is making 2^8-2^10 first accesses slower.
>>
>>Sounds like no-go for me.

Good point but the TLB with the translation counter would only need to be written back to the page table when the TLB gets evicted, not for every translation. We would also want this to be an optional TLB feature with an enable bit and only use it when it makes sense to. It would be a great memory profiling tool 

>>
>> TLB Shootdown (Hardware Enhancements)
>> --------------------------
>> We should stomp our feet and demand that TLB shootdowns should be 
>> hardware assisted in future architectures. Current TLB shootdown on 
>> x86 is horribly inefficient and obviously doesn't scale. The QPI/UPI 
>> local bus protocol should provide TLB range invalidation broadcast so 
>> that a single CPU can concurrently notify other CPU/cores (with a 
>> selection
>> mask) that a shared TLB entry has changed. Sending an IPI to each core 
>> is horribly inefficient; especially with the core counts increasing 
>> and the frequency of TLB unmapping/remapping also possibly increasing 
>> shortly with new server class memory extension technology.
>
>IIUC, the best you can get from hardware is IPI behind the scene.
>I doubt it worth the effort.
>

Yes, that's why this discussion topic is about hardware enhancements for TLB shootdown. The existing cache coherent protocol over QPI/UPI could be extended to include TLB invalidation(s) along with new TLBINV instructions which allow CPU/core masks and possible VA range(s). Consider how cool a new MOV CR3,EAX,EBX instruction would be where EAX is the Page directory pointer, and EBX is the CPU/CORE mask that selects which cores to broadcast to. The instruction would block until all cores have completed the CR3 update which also invalidates their TLB. Compare this to having to sequentially send IPI interrupts to each core, wait for each core to context switch and execute a TLB invalidate, then RTI and signal the originator. 

Adrian Michaud

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]