Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB

Jason Gunthorpe <jgg@xxxxxxxx> · Wed, 24 Nov 2021 19:11:33 -0400

On Wed, Nov 24, 2021 at 08:09:42PM +0100, David Hildenbrand wrote:

> That would be giving up on compound pages (hugetlbfs, THP, ...) on any
> current Linux system that does not use ZONE_MOVABLE -- which is not
> something I am not willing to buy into, just like our customers ;)

So we have ZONE_MOVABLE but users won't use it?

Then why is the solution to push the same kinds of restrictions as
ZONE_MOVABLE on to ZONE_NORMAL?

> See my other mail, the upstream version of my reproducer essentially
> shows what FOLL_LONGTERM is currently doing wrong with pageblocks. And
> at least to me that's an interesting insight :)

Hmm. To your reproducer it would be nice if we could cgroup control
the # of page blocks a cgroup has pinned. Focusing on # pages pinned
is clearly the wrong metric, I suggested the whole compound earlier,
but your point about the entire page block being ruined makes sense
too.

It means pinned pages will have be migrated to already ruined page
blocks the cgroup owns, which is a more controlled version of the
FOLL_LONGTERM migration you have been thinking about.

This would effectively limit the fragmentation a hostile process group
can create. If we further treated unmovable cgroup charged kernel
allocations as 'pinned' and routed them to the pinned page blocks it
start to look really interesting. Kill the cgroup, get all your THPs
back? Fragmentation cannot extend past the cgroup?

ie there are lots of batch workloads that could be interesting there -
wrap the batch in a cgroup, run it, then kill everything and since the
cgroup gives some lifetime clustering to the allocator you get a lot
less fragmentation when the batch is finished, so the next batch gets
more THPs, etc.

There is also sort of an interesting optimization opportunity - many
FOLL_LONGTERM users would be happy to spend more time pinning to get
nice contiguous memory ranges. Might help convince people that the
extra pin time for migrations is worthwhile.

> > Something like io_ring is registering a bulk amount of memory and then
> > doing some potentially long operations against it.
> 
> The individual operations it performs are comparable to O_DIRECT I think

Yes, and O_DIRECT can take 10s's of seconds in troubled cases with IO
timeouts and things.

Plus io_uring is worse as the buffer is potentially shared by many in
fight ops and you'd have to block new ops of the buffer and flush all
running ops before any mapping change can happen, all while holding up
a mmu notifier.

Not only is it bad for mm subsystem operations, but would
significantly harm io_uring performance if a migration hits.

So, I really don't like abusing mmu notifiers for stuff like this. I
didn't like it in virtio either :)

Jason