LSF-MM Volatile Ranges Discussion Plans ======================================= Just wanted to send this out to hopefully prime the discussion at lsf-mm tomorrow (should the schedule hold). Much of it is background material we won't have time to cover. First of all, this is my (John's) perspective here, Minchan may disagree with me on specifics here, but I think it covers the desired behavior fairly well, and I've tried to call out the places where we currently don't yet agree. Volatile Ranges: ---------------- Idea is from Android's ashmem feature (originally by Robert Love), which allows for unpinned ranges. I've been told other OSes support similar functionality (VM_FLAGS_PURGABLE and MEM_RESET/MEM_RESET_UNDO). Been slow going last 6-mo on my part, due to lots of adorable SIGBABY interruptions & other work. Concept in general: ------------------- Applications marks memory as volatile, allowing kernel to purge that memory if and when its needed. Applications can mark memory as non-volatile, and kernel will return a value to notify them if memory was purged while it was volatile. Use cases: ---------- Allows for eviction of userspace cache by the kernel, which is nice as applications don't have to tinker with optimizing cache sizes, as the kernel which has the global view will optimize it for them. Marking obscured bitmaps of rendered image data volatile. Ie: Keep compressed jpeg around, but mark volatile off-screen rendered bitmaps. Marking non-visible web-browser tabs as volatile. Lazy freeing of heap in malloc/free implementations. Parallel ways of thinking about it: ----------------------------------- Also similar to MADV_DONTNEED, but eviction is needs based, not instantaneous. Also applications can cancel eviction if it hasn't happened (by setting non-volatile). So sort of delayed and cancel-able MADV_DONTNEED. Can consider it like swapping some pages to /dev/null ? Rik's MADV_FREE was vary similar, but with implicit NON_VOLATILE marking on page-dirtying. Two basic usage-modes: ---------------------- 1) Application explicitly unmarks memory as volatile whenever it uses it, never touching memory marked volatile. If memory is purged, applications is notified when it marks the area as non-volatile. 2) Applications may access memory marked volatile, but should it access memory that was purged, it will receive SIGBUS On SIGBUS, application has to mark needed range as non-volatile, regenerate or re-fetch the data, and then can continue. This is a little more optimistic, but applications need to be able to handle getting a SIGBUS and fixing things up. This second optimistic method is desired by Mozilla folks. Important Goals: ---------------- Applications using this likely to mark and unmark ranges frequently (ideally only marking the data they immediately need as nonvolatile). This makes it necessary for these operations to be cheap, since applications won't volunteer their currently unused memory to the kernel if it adds dramatic overhead. Although this concerned is lessened with the optimistic/SIGBUS usage-mode. Overall, we try to push costs from the mark/unmark paths to the page eviction side. Two basic types of volatile memory: ----------------------------------- 1) File based memory 2) Anonymous memory Volatile ranges on file memory: ------------------------------- This allows for using volatile ranges on shared memory between processes. Very similar to ashmem's unpinned pages. One example: Two processes can create a large circular buffer, where any unused memory in that buffer is volatile. Producer marks memory as non-volatile, writes to it. The consumer would read the data, then mark it volatile. An important distinction here is that the volatility is shared, in the same way the file's data is shared. Its a property of the file's pages, not a property of the process that marked the range as volatile. Thus one application can mark file data as volatile, and the pages could be purged from all applications mapping that data. And a different application could mark it as non-volatile, and that would keep it from being purged from all applications. For this reason, the volatility is likely best to be stored on address_space (or otherwise connected to the address_space/inode). Another important semantic: Volatility is cleared when all fd's to a file are closed. There's no really good way for volatility to persist when no one is using a file. It could cause confusion if an application died leaving some file data volatile, and then had that data disappear as it was starting up again. No volatility across reboots! [TBD]: For the most-part, volatile ranges really only makes sense to me on tmpfs files. Mostly due to semantics of purging data on files is similar to hole punching, and I suspect having the resulting hole punched pushed out to disk would cause additional io and load. Partial range purging could have strange effects on resulting file. [TBD]: Minchan disagrees and thinks fadvise(DONTNEED) has problems, as it causes immediate writeout when there's plenty of free memory (possibly unnecessary). Although we may defer so long that the hole is never punched, which may be problematic. Volatile ranges on anonymous/process memory: -------------------------------------------- For anonymous memory, its mostly un-shared between processes (except copy-on-write pages). The only way to address anonymous memory is really relative to the process address space (its anonymous: there's no named handle to it). Same semantics as described above. Mark region of process memory volatile, or non-volatile. Volatility is a per-proecess (well mm_struct) state. Kernel will only purge a memory page, if *all* the processes that map that page in consider the page volatile. Important semantics: Preserve volatility over a fork, but clear child volatility on exec. So if a process marks a range as volatile then forks. Both the child and parent should see the same range as volatile. On memory pressure, kernel could purge those pages, since all of the processes that map that page consider it volatile. If the child writes to the pages, the COW links are broken, but both ranges ares still volatile, and can be purged until they are marked non-volatile or cleared. Then like mappings and the rest of memory, volatile ranges are cleared on exec. Implementation history: ----------------------- File-focused (John): Interval tree connected to address_space w/ global LRU of unpurged volatile ranges. Used shrinker to trigger purging off the lru. Numa folks complained that shrinker is numa-unaware and would cause purging on nodes not under pressure. File-focused (John): Checking volatility at page eviction time. Caused problems on swap-free systems, since tmpfs pages are anonymous and aren't aged/shrunk off lrus. In order to handle that we moved the pages to a volatile lru list, but that causes volatile/non-volatile operations to be very expensive O(n) for number of pages in the range. Anon-focused (Minchan): Store volatility in VMA. Worked well for anonymous ranges, but was problematic to extend to file ranges as we need volatility state to be connected with the file, not the process. Iterating across and splitting VMAs was somewhat costly. Anon-focused (Minchan): Store anonymous volatility in interval tree off of the mm_struct. Use global LRU of volatile ranges to use when purging ranges via a shrinker. Also hooks into normal eviction to make sure evicted pages are purged instead of swapped out. Very fast, due to quick manipulations to a single interval tree. File pages in ranges are ignored. Both (John): Same as above, but mostly extended so interval tree of ranges can be hung off of the mm_struct OR an address_space. Currently functionality is partitioned so volatile ranges on files and on anonymous memory are created via separate syscalls (fvrange(fd, start, len, ...) vs mvrange(start_addr, len,...)). Roughly merges the original first approach with the previous one. Both (John): Currently working on above, further extending mvrange() so it can also be used to set volatility on MAP_SHARED file mappings in an address space. Has the problem that handling both file and anonymous memory types in a single call requires iterating over vmas, which makes the operation more expensive. [TBD]: Cost impact of mvrange() supporting mapped file pages vs dev confusion of it not supporting file pages Current interfaces: ------------------- Two current interfaces: fvrange(fd, start_off, length, mode, flags, &purged) mvrange(start_addr, length, mode, flags, &purged) fd/start/length: Hopefully obvious :) mode: VOLATILE: Sets range as volatile. Returns number of bytes marked volatile. NON_VOLATILE: Marks range as non-volatile. Returns number of bytes marked non-volatile, sets purged value to 1 if any memory in the bytes marked non-volatile were purged. flags: VRANGE_FULL: On eviction, the entire range specified will be purged VRANGE_PARTIAL: On eviction, we may purge only part of the specified range. In earlier discussions, it was deemed that if any page in a volatile range was purged, we might as well purge the entire range, since if we mark any portion of that range as non-volatile, the application would have to regenerate the entire range. Thus we might as well reduce memory pressure by puring the entire range. However, with the SIGBUS semantics, applications may be able to continue accessing pages in a volatile range where one unused page is purged, so we may want to avoid purging the entire range to allow for optimistic continued use. Additionally partial purging is helpful so that we don't over-react when we have slight memory pressure. An example, if we have a 64M vrange, and the kernel only needs 8M, its much cheaper to free 8M now and then later when the range is marked non-volatile, re-allocate only 8M (fault + allocation + zero-clearing) instead of the entire 64M. [TBD]: May consider merging flags w/ mode: ie: VOLATILE_FULL, VOLATILE_PARTIAL, NON_VOLATILE [TBD]: Might be able to simplify and go with VRANGE_PARTIAL all the time? purged: Flag that returns 1 if any pages in the range marked NON_VOLATILE were purged. Is set to zero otherwise. Can be null if mode==VOLATILE. [TBD]: Might consider value passed to it will be |'ed with 1?. [TBD]: Might consider purged to be more of a status bitflag, allowing vrange(VOLATILE) calls to get some meaningful data like if memory pressure is currently going on. Return value: Number of bytes marked VOLATILE or NON_VOLATILE. This is necessary as if we are to deal with setting ranges that cross anonymous and file backed pages, we have to split the operations up into multiple operations against the respective mm_struct or addess_space, and there's a possibility that we could run out of memory mid-way through an operation. If we do run out of memory mid way, we simply return the number of bytes successfully marked, and we can return an error on the next invocation if we hit the ENOMEM right away. [TBD]: If mvrange() doesn't affect mapped file pages, then the return value can be simpler. Current TODOs: -------------- Add proper SIGBUS signaling when accessing purged file ranges. Working on handling mvrange() ranges that cross anonymous and mapped file regions. Handle errors mid-way through operations. Cleanups and better function names. [TBD] Contentious interface issues: ----------------------------------- Does handling mvrange() calls that cross anonymous & file pages increase costs too much for ebizzy workload Minchan likes? Have to take mmap_sem and traverse vmas. Could mvrange() on file pages not be shared in the same way as in fvrange() Sane interface vs Speed? Minchan's idea of mvrange(VOLATILE_FILE|VOLATILE_ANON|VOLATILE_BOTH): Avoid traversing vmas on VOLATILE_ANON flag, regardless of if range covers mapped file pages Not sure we can throw sane errors without checking vmas? Do we really need a new syscall interface? Can we maybe go back to using madvise? Should mvrange be prioritized over fvrange, if mvrange can create volatile ranges on files. Some folks still don't like SIGBUS on accessing a purged volatile page, instead want standard zero-fill fault. Need some way to know page was dropped (zero is a valid data value) After marking non-volatile, it can be zero-fill fault. [TBD] Contentious implementation issues: ---------------------------------------- Still using shrinker for purging, got early complaints from NUMA folks Can make sure we check first page in each range and purge only ranges where some page is in the zone being shrinked? Still use shrinker, but also use normal page shrinking path, but check for volatility. (swapless still needs shrinker) Probably don't want to actually hang vrange interval tree (vrange_root) off of address_space and struct_mm. In earlier attempts I used a hashtable to avoid this http://thread.gmane.org/gmane.linux.kernel/1278541/focus=1278542 I assume this is still a concern? Older non-contentious points: ----------------------------- Coalescing of ranges: Don't do it unless the ranges overlaps Range granular vs page granular purging: Resolved with _FULL/_PARTIAL flags Other ideas/use-cases proposed: ------------------------------- PTurner: Marking deep user-stack-frames as volatile to return that memory? Dmitry Vyukov: 20-80TB allocation, marked volatile right away. Never marking non-volatile. Wants zero-fill and doesn't want SIGBUG https://code.google.com/p/thread-sanitizer/wiki/VolatileRanges Misc: ---- Previous discussion: https://lwn.net/Articles/518130/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>