Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Fri, 10 Jan 2025 14:00:38 -0800

On Fri, Jan 10, 2025 at 10:13:17PM +0100, David Hildenbrand wrote:
> On 10.01.25 21:28, Jeff Layton wrote:
> > On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote:
> > > On 07.01.25 19:07, Shakeel Butt wrote:
> > > > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote:
> > > > > On 06.01.25 19:17, Shakeel Butt wrote:
> > > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
> > > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@xxxxxxxxxx> wrote:
> > > > > > > > In any case, having movable pages be turned unmovable due to persistent
> > > > > > > > writaback is something that must be fixed, not worked around. Likely a
> > > > > > > > good topic for LSF/MM.
> > > > > > > 
> > > > > > > Yes, this seems a good cross fs-mm topic.
> > > > > > > 
> > > > > > > So the issue discussed here is that movable pages used for fuse
> > > > > > > page-cache cause a problems when memory needs to be compacted. The
> > > > > > > problem is either that
> > > > > > > 
> > > > > > >     - the page is skipped, leaving the physical memory block unmovable
> > > > > > > 
> > > > > > >     - the compaction is blocked for an unbounded time
> > > > > > > 
> > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things
> > > > > > > worse, the same thing happens on readahead, since the new page can be
> > > > > > > locked for an indeterminate amount of time, which can also block
> > > > > > > compaction, right?
> > > > > 
> > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these
> > > > > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be
> > > > > unmovable pages ever*. Not triggered by an untrusted source, not triggered
> > > > > by an trusted source.
> > > > > 
> > > > > It's a violation of core-mm principles.
> > > > 
> > > > The "must not be unmovable pages ever" is a very strong statement and we
> > > > are violating it today and will keep violating it in future. Any
> > > > page/folio under lock or writeback or have reference taken or have been
> > > > isolated from their LRU is unmovable (most of the time for small period
> > > > of time).
> > > 
> > > ^ this: "small period of time" is what I meant.
> > > 
> > > Most of these things are known to not be problematic: retrying a couple
> > > of times makes it work, that's why migration keeps retrying.
> > > 
> > > Again, as an example, we allow short-term O_DIRECT but disallow
> > > long-term page pinning. I think there were concerns at some point if
> > > O_DIRECT might also be problematic (I/O might take a while), but so far
> > > it was not a problem in practice that would make CMA allocations easily
> > > fail.
> > > 
> > > vmsplice() is a known problem, because it behaves like O_DIRECT but
> > > actually triggers long-term pinning; IIRC David Howells has this on his
> > > todo list to fix. [I recall that seccomp disallows vmsplice by default
> > > right now]
> > > 
> > > These operations are being done all over the place in kernel.
> > > > Miklos gave an example of readahead.
> > > 
> > > I assume you mean "unmovable for a short time", correct, or can you
> > > point me at that specific example; I think I missed that.

Please see https://lore.kernel.org/all/CAJfpegthP2enc9o1hV-izyAG9nHcD_tT8dKFxxzhdQws6pcyhQ@xxxxxxxxxxxxxx/

> > > 
> > > > The per-CPU LRU caches are another
> > > > case where folios can get stuck for long period of time.
> > > 
> > > Which is why memory offlining disables the lru cache. See
> > > lru_cache_disable(). Other users that care about that drain the LRU on
> > > all cpus.
> > > 
> > > > Reclaim and
> > > > compaction can isolate a lot of folios that they need to have
> > > > too_many_isolated() checks. So, "must not be unmovable pages ever" is
> > > > impractical.
> > > 
> > > "must only be short-term unmovable", better?

Yes and you have clarified further below of the actual amount.

> > > 
> > 
> > Still a little ambiguous.
> > 
> > How short is "short-term"? Are we talking milliseconds or minutes?
> 
> Usually a couple of seconds, max. For memory offlining, slightly longer
> times are acceptable; other things (in particular compaction or CMA
> allocations) will give up much faster.
> 
> > 
> > Imposing a hard timeout on writeback requests to unprivileged FUSE
> > servers might give us a better guarantee of forward-progress, but it
> > would probably have to be on the order of at least a minute or so to be
> > workable.
> 
> Yes, and that might already be a bit too much, especially if stuck on
> waiting for folio writeback ... so ideally we could find a way to migrate
> these folios that are under writeback and it's not your ordinary disk driver
> that responds rather quickly.
> 
> Right now we do it via these temp pages, and I can see how that's
> undesirable.
> 
> For NFS etc. we probably never ran into this, because it's all used in
> fairly well managed environments and, well, I assume NFS easily outdates CMA
> and ZONE_MOVABLE :)
> 
> > >>>
> > > > The point is that, yes we should aim to improve things but in iterations
> > > > and "must not be unmovable pages ever" is not something we can achieve
> > > > in one step.
> > > 
> > > I agree with the "improve things in iterations", but as
> > > AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we
> > > are making things worse.

AS_WRITEBACK_INDETERMINATE is really a bad name we picked as it is still
causing confusion. It is a simple flag to avoid deadlock in the reclaim
code path and does not say anything about movability.

> > > 
> > > And as this discussion has been going on for too long, to summarize my
> > > point: there exist conditions where pages are short-term unmovable, and
> > > possibly some to be fixed that turn pages long-term unmovable (e.g.,
> > > vmsplice); that does not mean that we can freely add new conditions that
> > > turn movable pages unmovable long-term or even forever.
> > > 
> > > Again, this might be a good LSF/MM topic. If I would have the capacity I
> > > would suggest a topic around which things are know to cause pages to be
> > > short-term or long-term unmovable/unsplittable, and which can be
> > > handled, which not. Maybe I'll find the time to propose that as a topic.
> > > 
> > 
> > 
> > This does sound like great LSF/MM fodder! I predict that this session
> > will run long! ;)
> 
> Heh, fully agreed! :)

I would like more targeted topic and for that I want us to at least
agree where we are disagring. Let me write down two statements and
please tell me where you disagree:

1. For a normal running FUSE server (without tmp pages), the lifetime of
writeback state of fuse folios falls under "short-term unmovable" bucket
as it does not differ in anyway from anyother filesystems handling
writeback folios.

2. For a buggy or untrusted FUSE server (without tmp pages), the
lifetime of writeback state of fuse folios can be arbitrarily long and
we need some mechanism to limit it.