On 25.11.21 03:45, Shakeel Butt wrote: > Many applications do sophisticated management of their heap memory for > better performance but with low cost. We have a bunch of such > applications running on our production and examples include caching and > data storage services. These applications keep their hot data on the > THPs for better performance and release the cold data through > MADV_DONTNEED to keep the memory cost low. > > The kernel defers the split and release of THPs until there is memory > pressure. This complicates the memory management of these sophisticated > applications which then needs to look into low level kernel handling of > THPs to better gauge their headroom for expansion. > > More specifically these applications monitor their cgroup usage to decide > if they can expand the memory footprint or release some (unneeded/cold) > buffer. They uses madvise(MADV_DONTNEED) to release the memory which > basically puts the THP into defer list. These deferred THPs are still > charged to the cgroup which leads to bloated usage read by the application > and making wrong decisions. In addition these applications are very > latency sensitive and would prefer to not face memory reclaim due to > non-deterministic nature of reclaim. > > Internally we added a cgroup interface to trigger the split of deferred > THPs for that cgroup but this is hacky and exposing kernel internals to > users. This patch solves this problem in a more general way for the users > by splitting the THPS synchronously on MADV_DONTNEED. This patch does > the same for munmap() too. > I'll have to defer diving into the code. Just a comment: It might be good to add that there are still cases where splitting the compound page can fail -- for example, if the page is still pinned/referenced. So if you have a THP and intended to only pin/reference e.g., the first 4k of it (e.g., O_DIRECT, io_uring fixed buffers), MADV_DONTNEED/unmap e.g., the last 4k of it will not split synchronously. In addition to explicit user action on a compound page; I remember there might be other kernel-internal temporary references that could theoretically block splitting, but maybe most of them are at least for now limited to !compound pages. -- Thanks, David / dhildenb