On 23.12.24 23:14, Shakeel Butt wrote:
On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote:
[...]
Yes, so I can see fuse
(1) Breaking memory reclaim (memory cannot get freed up)
(2) Breaking page migration (memory cannot be migrated)
Due to (1) we might experience bigger memory pressure in the system I guess.
A handful of these pages don't really hurt, I have no idea how bad having
many of these pages can be. But yes, inherently we cannot throw away the
data as long as it is dirty without causing harm. (maybe we could move it to
some other cache, like swap/zswap; but that smells like a big and
complicated project)
Due to (2) we turn pages that are supposed to be movable possibly for a long
time unmovable. Even a *single* such page will mean that CMA allocations /
memory unplug can start failing.
We have similar situations with page pinning. With things like O_DIRECT, our
assumption/experience so far is that it will only take a couple of seconds
max, and retry loops are sufficient to handle it. That's why only long-term
pinning ("indeterminate", e.g., vfio) migrate these pages out of
ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them.
The biggest concern I have is that timeouts, while likely reasonable it many
scenarios, might not be desirable even for some sane workloads, and the
default in all system will be "no timeout", letting the clueless admin of
each and every system out there that might support fuse to make a decision.
I might have misunderstood something, in which case I am very sorry, but we
also don't want CMA allocations to start failing simply because a network
connection is down for a couple of minutes such that a fuse daemon cannot
make progress.
I think you have valid concerns but these are not new and not unique to
fuse. Any filesystem with a potential arbitrary stall can have similar
issues. The arbitrary stall can be caused due to network issues or some
faultly local storage.
What concerns me more is that this is can be triggered by even
unprivileged user space, and that there is no default protection as far
as I understood, because timeouts cannot be set universally to a sane
defaults.
Again, please correct me if I got that wrong.
BTW, I just looked at NFS out of interest, in particular
nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
canceling writeback. IIUC, there are default timeouts for UDP and TCP,
whereby the TCP default one seems to be around 60s (* retrans?), and the
privileged user that mounts it can set higher ones. I guess one could
run into similar writeback issues?
So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?
Not sure if I grasped all details about NFS and writeback and when it
would redirty+end writeback, and if there is some other handling in there.
Regarding the reclaim, I wouldn't say fuse or similar filesystem are
breaking memory reclaim as the kernel has mechanism to throttle the
threads dirtying the file memory to reduce the chance of situations
where most of memory becomes unreclaimable due to being dirty.
Yes, likely even cgroups can easily limit the amount.
Please note that such filesystems are mostly used in environments like
data center or hyperscalar and usually have more advanced mechanisms to
handle and avoid situations like long delays. For such environment
network unavailability is a larger issue than some cma allocation
failure. My point is: let's not assume the disastrous situaion is normal
and overcomplicate the solution.
Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be
used for movable allocations.
Mechanisms that possible turn these folios unmovable for a
long/indeterminate time must either fail or migrate these folios out of
these regions, otherwise we start violating the very semantics why
ZONE_MOVABLE/MIGRATE_CMA was added in the first place.
Yes, there are corner cases where we cannot guarantee movability (e.g.,
OOM when allocating a migration destination), but these are not cases
that can be triggered by (unprivileged) user space easily.
That's why FOLL_LONGTERM pinning does exactly that: even if user space
would promise that this is really only "short-term", we will treat it as
"possibly forever", because it's under user-space control.
Instead of having more subsystems violate these semantics because
"performance" ... I would hope we would do better. Maybe it's an issue
for NFS as well ("at least" only for privileged user space)? In which
case, again, I would hope we would do better.
Anyhow, I'm hoping there will be more feedback from other MM folks, but
likely right now a lot of people are out (just like I should ;) ).
If I end up being the only one with these concerns, then likely people
can feel free to ignore them. ;)
--
Cheers,
David / dhildenb