Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings

Jeff Layton <jlayton@xxxxxxxxxx> · Mon, 13 Jan 2025 16:44:26 -0500

On Mon, 2025-01-13 at 16:27 +0100, David Hildenbrand wrote:
> On 10.01.25 23:00, Shakeel Butt wrote:
> > On Fri, Jan 10, 2025 at 10:13:17PM +0100, David Hildenbrand wrote:
> > > On 10.01.25 21:28, Jeff Layton wrote:
> > > > On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote:
> > > > > On 07.01.25 19:07, Shakeel Butt wrote:
> > > > > > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote:
> > > > > > > On 06.01.25 19:17, Shakeel Butt wrote:
> > > > > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
> > > > > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@xxxxxxxxxx> wrote:
> > > > > > > > > > In any case, having movable pages be turned unmovable due to persistent
> > > > > > > > > > writaback is something that must be fixed, not worked around. Likely a
> > > > > > > > > > good topic for LSF/MM.
> > > > > > > > > 
> > > > > > > > > Yes, this seems a good cross fs-mm topic.
> > > > > > > > > 
> > > > > > > > > So the issue discussed here is that movable pages used for fuse
> > > > > > > > > page-cache cause a problems when memory needs to be compacted. The
> > > > > > > > > problem is either that
> > > > > > > > > 
> > > > > > > > >      - the page is skipped, leaving the physical memory block unmovable
> > > > > > > > > 
> > > > > > > > >      - the compaction is blocked for an unbounded time
> > > > > > > > > 
> > > > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things
> > > > > > > > > worse, the same thing happens on readahead, since the new page can be
> > > > > > > > > locked for an indeterminate amount of time, which can also block
> > > > > > > > > compaction, right?
> > > > > > > 
> > > > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these
> > > > > > > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be
> > > > > > > unmovable pages ever*. Not triggered by an untrusted source, not triggered
> > > > > > > by an trusted source.
> > > > > > > 
> > > > > > > It's a violation of core-mm principles.
> > > > > > 
> > > > > > The "must not be unmovable pages ever" is a very strong statement and we
> > > > > > are violating it today and will keep violating it in future. Any
> > > > > > page/folio under lock or writeback or have reference taken or have been
> > > > > > isolated from their LRU is unmovable (most of the time for small period
> > > > > > of time).
> > > > > 
> > > > > ^ this: "small period of time" is what I meant.
> > > > > 
> > > > > Most of these things are known to not be problematic: retrying a couple
> > > > > of times makes it work, that's why migration keeps retrying.
> > > > > 
> > > > > Again, as an example, we allow short-term O_DIRECT but disallow
> > > > > long-term page pinning. I think there were concerns at some point if
> > > > > O_DIRECT might also be problematic (I/O might take a while), but so far
> > > > > it was not a problem in practice that would make CMA allocations easily
> > > > > fail.
> > > > > 
> > > > > vmsplice() is a known problem, because it behaves like O_DIRECT but
> > > > > actually triggers long-term pinning; IIRC David Howells has this on his
> > > > > todo list to fix. [I recall that seccomp disallows vmsplice by default
> > > > > right now]
> > > > > 
> > > > > These operations are being done all over the place in kernel.
> > > > > > Miklos gave an example of readahead.
> > > > > 
> > > > > I assume you mean "unmovable for a short time", correct, or can you
> > > > > point me at that specific example; I think I missed that.
> > 
> > Please see https://lore.kernel.org/all/CAJfpegthP2enc9o1hV-izyAG9nHcD_tT8dKFxxzhdQws6pcyhQ@xxxxxxxxxxxxxx/
> > 
> > > > > 
> > > > > > The per-CPU LRU caches are another
> > > > > > case where folios can get stuck for long period of time.
> > > > > 
> > > > > Which is why memory offlining disables the lru cache. See
> > > > > lru_cache_disable(). Other users that care about that drain the LRU on
> > > > > all cpus.
> > > > > 
> > > > > > Reclaim and
> > > > > > compaction can isolate a lot of folios that they need to have
> > > > > > too_many_isolated() checks. So, "must not be unmovable pages ever" is
> > > > > > impractical.
> > > > > 
> > > > > "must only be short-term unmovable", better?
> > 
> > Yes and you have clarified further below of the actual amount.
> > 
> > > > > 
> > > > 
> > > > Still a little ambiguous.
> > > > 
> > > > How short is "short-term"? Are we talking milliseconds or minutes?
> > > 
> > > Usually a couple of seconds, max. For memory offlining, slightly longer
> > > times are acceptable; other things (in particular compaction or CMA
> > > allocations) will give up much faster.
> > > 
> > > > 
> > > > Imposing a hard timeout on writeback requests to unprivileged FUSE
> > > > servers might give us a better guarantee of forward-progress, but it
> > > > would probably have to be on the order of at least a minute or so to be
> > > > workable.
> > > 
> > > Yes, and that might already be a bit too much, especially if stuck on
> > > waiting for folio writeback ... so ideally we could find a way to migrate
> > > these folios that are under writeback and it's not your ordinary disk driver
> > > that responds rather quickly.
> > > 
> > > Right now we do it via these temp pages, and I can see how that's
> > > undesirable.
> > > 
> > > For NFS etc. we probably never ran into this, because it's all used in
> > > fairly well managed environments and, well, I assume NFS easily outdates CMA
> > > and ZONE_MOVABLE :)
> > > 
> > > > > > > 
> > > > > > The point is that, yes we should aim to improve things but in iterations
> > > > > > and "must not be unmovable pages ever" is not something we can achieve
> > > > > > in one step.
> > > > > 
> > > > > I agree with the "improve things in iterations", but as
> > > > > AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we
> > > > > are making things worse.
> > 
> > AS_WRITEBACK_INDETERMINATE is really a bad name we picked as it is still
> > causing confusion. It is a simple flag to avoid deadlock in the reclaim
> > code path and does not say anything about movability.
> > 
> > > > > 
> > > > > And as this discussion has been going on for too long, to summarize my
> > > > > point: there exist conditions where pages are short-term unmovable, and
> > > > > possibly some to be fixed that turn pages long-term unmovable (e.g.,
> > > > > vmsplice); that does not mean that we can freely add new conditions that
> > > > > turn movable pages unmovable long-term or even forever.
> > > > > 
> > > > > Again, this might be a good LSF/MM topic. If I would have the capacity I
> > > > > would suggest a topic around which things are know to cause pages to be
> > > > > short-term or long-term unmovable/unsplittable, and which can be
> > > > > handled, which not. Maybe I'll find the time to propose that as a topic.
> > > > > 
> > > > 
> > > > 
> > > > This does sound like great LSF/MM fodder! I predict that this session
> > > > will run long! ;)
> > > 
> > > Heh, fully agreed! :)
> > 
> > I would like more targeted topic and for that I want us to at least
> > agree where we are disagring. Let me write down two statements and
> > please tell me where you disagree:
> 
> I think we're mostly in agreement!
> 
> > 
> > 1. For a normal running FUSE server (without tmp pages), the lifetime of
> > writeback state of fuse folios falls under "short-term unmovable" bucket
> > as it does not differ in anyway from anyother filesystems handling
> > writeback folios.
> 
> That's the expectation, yes. As long as the FUSE server is able to make 
> progress, the expectation is that it's just like NFS etc. If it isn't 
> able to make progress (i.e., crash), the expectation is that everything 
> will get cleaned up either way.
> 
> I wonder if there could be valid scenario where the FUSE server is no 
> longer able to make progress (ignoring network outages), or the progress 
> might start being extremely slow such that it becomes a problem. In 
> contrast to in-kernel FSs, one can do some fancy stuff with fuse where 
> writing a page could possibly consume a lot of memory in user-space. 
> Likely, in this case we might just blame it on the admin that agreed to 
> running this (trusted) fuse server.
> 
> > 
> > 2. For a buggy or untrusted FUSE server (without tmp pages), the
> > lifetime of writeback state of fuse folios can be arbitrarily long and
> > we need some mechanism to limit it.
> 
> Yes.
> 
> 
> Especially in 1), we really want to wait for writeback to finish, just 
> like for any other filesystem. For 2), we want a way so writeback will 
> not get stuck for a long time, but are able to make progress and migrate 
> these pages.
> 

What if we were to allow the kernel to kill off an unprivileged FUSE
server that was "misbehaving" [1], clean any dirty pagecache pages that
it has, and set writeback errors on the corresponding FUSE inodes [2]?
We'd still need a rather long timeout (on the order of at least a
minute or so, by default).

Would that be enough to assuage concerns about unprivileged servers
pinning pages indefinitely? Buggy servers are still a problem, but
there's not much we can do about that.

There are a lot of details we'd have to sort out, so I'm also
interested in whether anyone (Miklos? Bernd?) would find this basic
approach objectionable.

[1]: for some definition of misbehavior. Probably a writeback
timeout of some sort but maybe there would be other criteria too.

[2]: or maybe just make them eligible to be cleaned without talking to
the server, should the VM wish it.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>