On 20.12.24 19:01, Shakeel Butt wrote:
On Fri, Dec 20, 2024 at 03:49:39PM +0100, David Hildenbrand wrote:
I'm wondering if there would be a way to just "cancel" the writeback and
mark the folio dirty again. That way it could be migrated, but not
reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE
thing.
That is what I basically meant with short timeouts. Obviously it is not
that simple to cancel the request and to retry - it would add in quite
some complexity, if all the issues that arise can be solved at all.
At least it would keep that out of core-mm.
AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should try to
improve such scenarios, not acknowledge and integrate them, then work around
using timeouts that must be manually configured, and ca likely no be default
enabled because it could hurt reasonable use cases :(
Just to be clear AS_WRITEBACK_INDETERMINATE is being used in two core-mm
parts. First is reclaim and second is compaction/migration. For reclaim,
it is a must have as explained by Jingbo in [1] i.e. due to potential
self deadlock by fuse server. If I understand you correctly, the main
concern you have is its usage in the second case.
Yes, so I can see fuse
(1) Breaking memory reclaim (memory cannot get freed up)
(2) Breaking page migration (memory cannot be migrated)
Due to (1) we might experience bigger memory pressure in the system I
guess. A handful of these pages don't really hurt, I have no idea how
bad having many of these pages can be. But yes, inherently we cannot
throw away the data as long as it is dirty without causing harm. (maybe
we could move it to some other cache, like swap/zswap; but that smells
like a big and complicated project)
Due to (2) we turn pages that are supposed to be movable possibly for a
long time unmovable. Even a *single* such page will mean that CMA
allocations / memory unplug can start failing.
We have similar situations with page pinning. With things like O_DIRECT,
our assumption/experience so far is that it will only take a couple of
seconds max, and retry loops are sufficient to handle it. That's why
only long-term pinning ("indeterminate", e.g., vfio) migrate these pages
out of ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them.
The biggest concern I have is that timeouts, while likely reasonable it
many scenarios, might not be desirable even for some sane workloads, and
the default in all system will be "no timeout", letting the clueless
admin of each and every system out there that might support fuse to make
a decision.
I might have misunderstood something, in which case I am very sorry, but
we also don't want CMA allocations to start failing simply because a
network connection is down for a couple of minutes such that a fuse
daemon cannot make progress.
The reason for adding AS_WRITEBACK_INDETERMINATE in the second case was
to avoid untrusted fuse server causing pain to unrelated jobs on the
machine (fuse folks please correct me if I am wrong here). Now we are
discussing how to better handle that scenario.
I just wanted to point out that irrespective of that discussion, the
reclaim will have handle the potential recursive deadlock and thus will
be using AS_WRITEBACK_INDETERMINATE or something similar.
Yes, I see no way to throw away dirty data without causing harm.
Migration was kept working for now, although in a hacky fashion I admit.
I do enjoy that "writeback" on the folio actually matches the reality now.
I guess an alternative to "aborting writeback" would be to make fuse
allow for migrating folios that are under writeback. I would assume that
with fuse we have very good control over who is currently
reading/writing that folio, and we could swap it out? Again, just an
idea ...
--
Cheers,
David / dhildenb