Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings

David Hildenbrand <david@xxxxxxxxxx> · Mon, 13 Jan 2025 16:27:34 +0100

On 10.01.25 23:00, Shakeel Butt wrote:
On Fri, Jan 10, 2025 at 10:13:17PM +0100, David Hildenbrand wrote:
On 10.01.25 21:28, Jeff Layton wrote:
On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote:
On 07.01.25 19:07, Shakeel Butt wrote:
On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote:
On 06.01.25 19:17, Shakeel Butt wrote:
On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@xxxxxxxxxx> wrote:
In any case, having movable pages be turned unmovable due to persistent
writaback is something that must be fixed, not worked around. Likely a
good topic for LSF/MM.

Yes, this seems a good cross fs-mm topic.

So the issue discussed here is that movable pages used for fuse
page-cache cause a problems when memory needs to be compacted. The
problem is either that

     - the page is skipped, leaving the physical memory block unmovable

     - the compaction is blocked for an unbounded time

While the new AS_WRITEBACK_INDETERMINATE could potentially make things
worse, the same thing happens on readahead, since the new page can be
locked for an indeterminate amount of time, which can also block
compaction, right?

Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these
pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be
unmovable pages ever*. Not triggered by an untrusted source, not triggered
by an trusted source.

It's a violation of core-mm principles.

The "must not be unmovable pages ever" is a very strong statement and we
are violating it today and will keep violating it in future. Any
page/folio under lock or writeback or have reference taken or have been
isolated from their LRU is unmovable (most of the time for small period
of time).

^ this: "small period of time" is what I meant.

Most of these things are known to not be problematic: retrying a couple
of times makes it work, that's why migration keeps retrying.

Again, as an example, we allow short-term O_DIRECT but disallow
long-term page pinning. I think there were concerns at some point if
O_DIRECT might also be problematic (I/O might take a while), but so far
it was not a problem in practice that would make CMA allocations easily
fail.

vmsplice() is a known problem, because it behaves like O_DIRECT but
actually triggers long-term pinning; IIRC David Howells has this on his
todo list to fix. [I recall that seccomp disallows vmsplice by default
right now]

These operations are being done all over the place in kernel.
Miklos gave an example of readahead.

I assume you mean "unmovable for a short time", correct, or can you
point me at that specific example; I think I missed that.

Please see https://lore.kernel.org/all/CAJfpegthP2enc9o1hV-izyAG9nHcD_tT8dKFxxzhdQws6pcyhQ@xxxxxxxxxxxxxx/

The per-CPU LRU caches are another
case where folios can get stuck for long period of time.

Which is why memory offlining disables the lru cache. See
lru_cache_disable(). Other users that care about that drain the LRU on
all cpus.

Reclaim and
compaction can isolate a lot of folios that they need to have
too_many_isolated() checks. So, "must not be unmovable pages ever" is
impractical.

"must only be short-term unmovable", better?

Yes and you have clarified further below of the actual amount.

Still a little ambiguous.

How short is "short-term"? Are we talking milliseconds or minutes?

Usually a couple of seconds, max. For memory offlining, slightly longer
times are acceptable; other things (in particular compaction or CMA
allocations) will give up much faster.

Imposing a hard timeout on writeback requests to unprivileged FUSE
servers might give us a better guarantee of forward-progress, but it
would probably have to be on the order of at least a minute or so to be
workable.

Yes, and that might already be a bit too much, especially if stuck on
waiting for folio writeback ... so ideally we could find a way to migrate
these folios that are under writeback and it's not your ordinary disk driver
that responds rather quickly.

Right now we do it via these temp pages, and I can see how that's
undesirable.

For NFS etc. we probably never ran into this, because it's all used in
fairly well managed environments and, well, I assume NFS easily outdates CMA
and ZONE_MOVABLE :)

The point is that, yes we should aim to improve things but in iterations
and "must not be unmovable pages ever" is not something we can achieve
in one step.

I agree with the "improve things in iterations", but as
AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we
are making things worse.

AS_WRITEBACK_INDETERMINATE is really a bad name we picked as it is still
causing confusion. It is a simple flag to avoid deadlock in the reclaim
code path and does not say anything about movability.

And as this discussion has been going on for too long, to summarize my
point: there exist conditions where pages are short-term unmovable, and
possibly some to be fixed that turn pages long-term unmovable (e.g.,
vmsplice); that does not mean that we can freely add new conditions that
turn movable pages unmovable long-term or even forever.

Again, this might be a good LSF/MM topic. If I would have the capacity I
would suggest a topic around which things are know to cause pages to be
short-term or long-term unmovable/unsplittable, and which can be
handled, which not. Maybe I'll find the time to propose that as a topic.

This does sound like great LSF/MM fodder! I predict that this session
will run long! ;)

Heh, fully agreed! :)

I would like more targeted topic and for that I want us to at least
agree where we are disagring. Let me write down two statements and
please tell me where you disagree:

I think we're mostly in agreement!

1. For a normal running FUSE server (without tmp pages), the lifetime of
writeback state of fuse folios falls under "short-term unmovable" bucket
as it does not differ in anyway from anyother filesystems handling
writeback folios.

That's the expectation, yes. As long as the FUSE server is able to make 
progress, the expectation is that it's just like NFS etc. If it isn't 
able to make progress (i.e., crash), the expectation is that everything 
will get cleaned up either way.

I wonder if there could be valid scenario where the FUSE server is no 
longer able to make progress (ignoring network outages), or the progress 
might start being extremely slow such that it becomes a problem. In 
contrast to in-kernel FSs, one can do some fancy stuff with fuse where 
writing a page could possibly consume a lot of memory in user-space. 
Likely, in this case we might just blame it on the admin that agreed to 
running this (trusted) fuse server.

2. For a buggy or untrusted FUSE server (without tmp pages), the
lifetime of writeback state of fuse folios can be arbitrarily long and
we need some mechanism to limit it.

Yes.

Especially in 1), we really want to wait for writeback to finish, just 
like for any other filesystem. For 2), we want a way so writeback will 
not get stuck for a long time, but are able to make progress and migrate 
these pages.

--
Cheers,

David / dhildenb