On Wed, Nov 24, 2021 at 08:09:42PM +0100, David Hildenbrand wrote: > That would be giving up on compound pages (hugetlbfs, THP, ...) on any > current Linux system that does not use ZONE_MOVABLE -- which is not > something I am not willing to buy into, just like our customers ;) So we have ZONE_MOVABLE but users won't use it? Then why is the solution to push the same kinds of restrictions as ZONE_MOVABLE on to ZONE_NORMAL? > See my other mail, the upstream version of my reproducer essentially > shows what FOLL_LONGTERM is currently doing wrong with pageblocks. And > at least to me that's an interesting insight :) Hmm. To your reproducer it would be nice if we could cgroup control the # of page blocks a cgroup has pinned. Focusing on # pages pinned is clearly the wrong metric, I suggested the whole compound earlier, but your point about the entire page block being ruined makes sense too. It means pinned pages will have be migrated to already ruined page blocks the cgroup owns, which is a more controlled version of the FOLL_LONGTERM migration you have been thinking about. This would effectively limit the fragmentation a hostile process group can create. If we further treated unmovable cgroup charged kernel allocations as 'pinned' and routed them to the pinned page blocks it start to look really interesting. Kill the cgroup, get all your THPs back? Fragmentation cannot extend past the cgroup? ie there are lots of batch workloads that could be interesting there - wrap the batch in a cgroup, run it, then kill everything and since the cgroup gives some lifetime clustering to the allocator you get a lot less fragmentation when the batch is finished, so the next batch gets more THPs, etc. There is also sort of an interesting optimization opportunity - many FOLL_LONGTERM users would be happy to spend more time pinning to get nice contiguous memory ranges. Might help convince people that the extra pin time for migrations is worthwhile. > > Something like io_ring is registering a bulk amount of memory and then > > doing some potentially long operations against it. > > The individual operations it performs are comparable to O_DIRECT I think Yes, and O_DIRECT can take 10s's of seconds in troubled cases with IO timeouts and things. Plus io_uring is worse as the buffer is potentially shared by many in fight ops and you'd have to block new ops of the buffer and flush all running ops before any mapping change can happen, all while holding up a mmu notifier. Not only is it bad for mm subsystem operations, but would significantly harm io_uring performance if a migration hits. So, I really don't like abusing mmu notifiers for stuff like this. I didn't like it in virtio either :) Jason