Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings

David Hildenbrand <david@xxxxxxxxxx> · Mon, 30 Dec 2024 11:16:17 +0100

BTW, I just looked at NFS out of interest, in particular
nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
canceling writeback. IIUC, there are default timeouts for UDP and TCP,
whereby the TCP default one seems to be around 60s (* retrans?), and the
privileged user that mounts it can set higher ones. I guess one could run
into similar writeback issues?

Hi,

sorry for the late reply.

Yes, I think so.

So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?

I feel like INDETERMINATE in the name is the main cause of confusion.

We are adding logic that says "unconditionally, never wait on writeback 
for these folios, not even any sync migration". That's the main problem 
I have.

Your explanation below is helpful. Because ...

So, let me explain why it is required (but later I will tell you how it
can be avoided). The FUSE thread which is actively handling writeback of
a given folio can cause memory allocation either through syscall or page
fault. That memory allocation can trigger global reclaim synchronously
and in cgroup-v1, that FUSE thread can wait on the writeback on the same
folio whose writeback it is supposed to end and cauing a deadlock. So,
AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock.
> > The in-kernel fs avoid this situation through the use of GFP_NOFS
allocations. The userspace fs can also use a similar approach which is
prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been
told that it is hard to use as it is per-thread flag and has to be set
for all the threads handling writeback which can be error prone if the
threadpool is dynamic. Second it is very coarse such that all the
allocations from those threads (e.g. page faults) become NOFS which
makes userspace very unreliable on highly utilized machine as NOFS can
not reclaim potentially a lot of memory and can not trigger oom-kill.

... now I understand that we want to prevent a deadlock in one specific 
scenario only?

What sounds plausible for me is:

a) Make this only affect the actual deadlock path: sync migration
   during compaction. Communicate it either using some "context"
   information or with a new MIGRATE_SYNC_COMPACTION.
b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
    that very deadlock problem.
c) Leave all others sync migration users alone for now

Would that prevent the deadlock? Even *better* would be to to be able to 
ask the fs if starting writeback on a specific folio could deadlock. 
Because in most cases, as I understand, we'll  not actually run into the 
deadlock and would just want to wait for writeback to just complete 
(esp. compaction).

(I still think having folios under writeback for a long time might be a 
problem, but that's indeed something to sort out separately in the 
future, because I suspect NFS has similar issues. We'd want to "wait 
with timeout" and e.g., cancel writeback during memory 
offlining/alloc_cma ...)

Not
sure if I grasped all details about NFS and writeback and when it would
redirty+end writeback, and if there is some other handling in there.

[...]

Please note that such filesystems are mostly used in environments like
data center or hyperscalar and usually have more advanced mechanisms to
handle and avoid situations like long delays. For such environment
network unavailability is a larger issue than some cma allocation
failure. My point is: let's not assume the disastrous situaion is normal
and overcomplicate the solution.

Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used
for movable allocations.

Mechanisms that possible turn these folios unmovable for a
long/indeterminate time must either fail or migrate these folios out of
these regions, otherwise we start violating the very semantics why
ZONE_MOVABLE/MIGRATE_CMA was added in the first place.

Yes, there are corner cases where we cannot guarantee movability (e.g., OOM
when allocating a migration destination), but these are not cases that can
be triggered by (unprivileged) user space easily.

That's why FOLL_LONGTERM pinning does exactly that: even if user space would
promise that this is really only "short-term", we will treat it as "possibly
forever", because it's under user-space control.

Instead of having more subsystems violate these semantics because
"performance" ... I would hope we would do better. Maybe it's an issue for
NFS as well ("at least" only for privileged user space)? In which case,
again, I would hope we would do better.

Anyhow, I'm hoping there will be more feedback from other MM folks, but
likely right now a lot of people are out (just like I should ;) ).

If I end up being the only one with these concerns, then likely people can
feel free to ignore them. ;)

I agree we should do better but IMHO it should be an iterative process.
> I think your concerns are valid, so let's push the discussion 
towards> resolving those concerns. I think the concerns can be resolved 
by better
handling of lifetime of folios under writeback. The amount of such
folios is already handled through existing dirty throttling mechanism.

We should start with a baseline i.e. distribution of lifetime of folios
under writeback for traditional storage devices (spinning disk and SSDs)
as we don't want an unrealistic goal for ourself. I think this data will
drive the appropriate timeout values (if we decide timeout based
approach is the right one).

At the moment we have timeout based approach to limit the lifetime of
folios under writeback. Any other ideas?

See above, maybe we could limit the deadlock avoidance to the actual 
deadlock path and sort out the "infinite writeback in some corner cases" 
problem separately.

--
Cheers,

David / dhildenb