Hello, On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote: > essentially fragmented them. I guess hugepaged went through and > started trying to reassemble the huge pages and as a result there have > been apps that ended up consuming more memory than they would have > otherwise since they were using fragments of THP pages after doing an > MADV_DONTNEED on sections of the page. With relatively recent kernels MADV_DONTNEED doesn't necessarily free anything when it's applied to a THP subpage, it only splits the pagetables and queues the THP for deferred splitting. If there's memory pressure a shrinker will be invoked and the queue is scanned and the THPs are physically splitted, but to be reassembled/collapsed after a physical split it requires at least one young pte. If this is particularly problematic for page hinting, this behavior where the MADV_DONTNEED can be undoed by khugepaged (if some subpage is being frequently accessed), can be turned off by setting /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none to 0. Then the THP will only be collapsed if all 512 subpages are mapped (i.e. they've all be re-allocated by the guest). Regardless of the max_ptes_none default, keeping the smaller guest buddy orders as the last target for page hinting should be good for performance. > Yeah, no problem. The only thing I don't like about MADV_FREE is that > you have to have memory pressure before the pages really start getting > scrubbed with is both a benefit and a drawback. Basically it defers > the freeing until you are under actual memory pressure so when you hit > that case things start feeling much slower, that and it limits your > allocations since the kernel doesn't recognize the pages as free until > it would have to start trying to push memory to swap. The guest allocation behavior should not be influenced by MADV_FREE vs MADV_DONTNEED, the guest can't see the difference anyway, so why should it limit the allocations? The benefit of MADV_FREE should be that when the same guest frees and reallocates an huge amount of RAM (i.e. guest app allocating and freeing lots of RAM in a loop, not so uncommon), there will be no KVM page fault during guest re-allocations. So in absence of memory pressure in the host it should be a major win. Overall it sounds like a good tradeoff compared to MADV_DONTNEED that forcefully invokes MMU notifiers and forces host allocations and KVM page faults in order to reallocate the same RAM in the same guest. When there's memory pressure it's up to the host Linux VM to notice there's plenty of MADV_FREE material to free at zero I/O cost before starting swapping. Thanks, Andrea