On Fri, Dec 27, 2024 at 12:32 PM Bernd Schubert <bernd.schubert@xxxxxxxxxxx> wrote: > > On 12/27/24 21:08, Joanne Koong wrote: > > On Thu, Dec 26, 2024 at 12:13 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > >> > >> On Tue, Dec 24, 2024 at 01:37:49PM +0100, David Hildenbrand wrote: > >>> On 23.12.24 23:14, Shakeel Butt wrote: > >>>> On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote: > >>>> [...] > >>>>> > >>>>> Yes, so I can see fuse > >>>>> > >>>>> (1) Breaking memory reclaim (memory cannot get freed up) > >>>>> > >>>>> (2) Breaking page migration (memory cannot be migrated) > >>>>> > >>>>> Due to (1) we might experience bigger memory pressure in the system I guess. > >>>>> A handful of these pages don't really hurt, I have no idea how bad having > >>>>> many of these pages can be. But yes, inherently we cannot throw away the > >>>>> data as long as it is dirty without causing harm. (maybe we could move it to > >>>>> some other cache, like swap/zswap; but that smells like a big and > >>>>> complicated project) > >>>>> > >>>>> Due to (2) we turn pages that are supposed to be movable possibly for a long > >>>>> time unmovable. Even a *single* such page will mean that CMA allocations / > >>>>> memory unplug can start failing. > >>>>> > >>>>> We have similar situations with page pinning. With things like O_DIRECT, our > >>>>> assumption/experience so far is that it will only take a couple of seconds > >>>>> max, and retry loops are sufficient to handle it. That's why only long-term > >>>>> pinning ("indeterminate", e.g., vfio) migrate these pages out of > >>>>> ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them. > >>>>> > >>>>> > >>>>> The biggest concern I have is that timeouts, while likely reasonable it many > >>>>> scenarios, might not be desirable even for some sane workloads, and the > >>>>> default in all system will be "no timeout", letting the clueless admin of > >>>>> each and every system out there that might support fuse to make a decision. > >>>>> > >>>>> I might have misunderstood something, in which case I am very sorry, but we > >>>>> also don't want CMA allocations to start failing simply because a network > >>>>> connection is down for a couple of minutes such that a fuse daemon cannot > >>>>> make progress. > >>>>> > >>>> > >>>> I think you have valid concerns but these are not new and not unique to > >>>> fuse. Any filesystem with a potential arbitrary stall can have similar > >>>> issues. The arbitrary stall can be caused due to network issues or some > >>>> faultly local storage. > >>> > >>> What concerns me more is that this is can be triggered by even unprivileged > >>> user space, and that there is no default protection as far as I understood, > >>> because timeouts cannot be set universally to a sane defaults. > >>> > >>> Again, please correct me if I got that wrong. > >>> > >> > >> Let's route this question to FUSE folks. More specifically: can an > >> unprivileged process create a mount point backed by itself, create a > >> lot of dirty (bound by cgroup) and writeback pages on it and let the > >> writeback pages in that state forever? > >> > >>> > >>> BTW, I just looked at NFS out of interest, in particular > >>> nfs_page_async_flush(), and I spot some logic about re-dirtying pages + > >>> canceling writeback. IIUC, there are default timeouts for UDP and TCP, > >>> whereby the TCP default one seems to be around 60s (* retrans?), and the > >>> privileged user that mounts it can set higher ones. I guess one could run > >>> into similar writeback issues? > >> > >> Yes, I think so. > >> > >>> > >>> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? > >> > >> I feel like INDETERMINATE in the name is the main cause of confusion. > >> So, let me explain why it is required (but later I will tell you how it > >> can be avoided). The FUSE thread which is actively handling writeback of > >> a given folio can cause memory allocation either through syscall or page > >> fault. That memory allocation can trigger global reclaim synchronously > >> and in cgroup-v1, that FUSE thread can wait on the writeback on the same > >> folio whose writeback it is supposed to end and cauing a deadlock. So, > >> AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > >> > >> The in-kernel fs avoid this situation through the use of GFP_NOFS > >> allocations. The userspace fs can also use a similar approach which is > >> prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been > >> told that it is hard to use as it is per-thread flag and has to be set > >> for all the threads handling writeback which can be error prone if the > >> threadpool is dynamic. Second it is very coarse such that all the > >> allocations from those threads (e.g. page faults) become NOFS which > >> makes userspace very unreliable on highly utilized machine as NOFS can > >> not reclaim potentially a lot of memory and can not trigger oom-kill. > >> > >>> Not > >>> sure if I grasped all details about NFS and writeback and when it would > >>> redirty+end writeback, and if there is some other handling in there. > >>> > >> [...] > >>>> > >>>> Please note that such filesystems are mostly used in environments like > >>>> data center or hyperscalar and usually have more advanced mechanisms to > >>>> handle and avoid situations like long delays. For such environment > >>>> network unavailability is a larger issue than some cma allocation > >>>> failure. My point is: let's not assume the disastrous situaion is normal > >>>> and overcomplicate the solution. > >>> > >>> Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used > >>> for movable allocations. > >>> > >>> Mechanisms that possible turn these folios unmovable for a > >>> long/indeterminate time must either fail or migrate these folios out of > >>> these regions, otherwise we start violating the very semantics why > >>> ZONE_MOVABLE/MIGRATE_CMA was added in the first place. > >>> > >>> Yes, there are corner cases where we cannot guarantee movability (e.g., OOM > >>> when allocating a migration destination), but these are not cases that can > >>> be triggered by (unprivileged) user space easily. > >>> > >>> That's why FOLL_LONGTERM pinning does exactly that: even if user space would > >>> promise that this is really only "short-term", we will treat it as "possibly > >>> forever", because it's under user-space control. > >>> > >>> > >>> Instead of having more subsystems violate these semantics because > >>> "performance" ... I would hope we would do better. Maybe it's an issue for > >>> NFS as well ("at least" only for privileged user space)? In which case, > >>> again, I would hope we would do better. > >>> > >>> > >>> Anyhow, I'm hoping there will be more feedback from other MM folks, but > >>> likely right now a lot of people are out (just like I should ;) ). > >>> > >>> If I end up being the only one with these concerns, then likely people can > >>> feel free to ignore them. ;) > >> > >> I agree we should do better but IMHO it should be an iterative process. > >> I think your concerns are valid, so let's push the discussion towards > >> resolving those concerns. I think the concerns can be resolved by better > >> handling of lifetime of folios under writeback. The amount of such > >> folios is already handled through existing dirty throttling mechanism. > >> > >> We should start with a baseline i.e. distribution of lifetime of folios > >> under writeback for traditional storage devices (spinning disk and SSDs) > >> as we don't want an unrealistic goal for ourself. I think this data will > >> drive the appropriate timeout values (if we decide timeout based > >> approach is the right one). > >> > >> At the moment we have timeout based approach to limit the lifetime of > >> folios under writeback. Any other ideas? > > > > I don't see any other approach that would handle splice, other than > > modifying the splice code to prevent the underlying buf->page from > > being migrated while it's being copied out, which seems non-viable to > > consider. The other alternatives I see are to either a) do the extra > > temp page copying for splice and "abort" the writeback if migration is > > triggered or b) gate this to only apply to servers running as > > privileged. I assume the majority of use cases do use splice, in which > > case a) would be pointless and would make the internal logic more > > complicated (eg we would still need the rb tree and would now need to > > check writeback against the folio writeback state or the rb tree, > > etc). I'm not sure how useful this would be either if this is just > > gated to privileged servers. > > > I'm not so sure about that majority of unprivileged servers. > Try this patch and then run an unprivileged process. > > diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c > index ee0b3b1d0470..adebfbc03d4c 100644 > --- a/lib/fuse_lowlevel.c > +++ b/lib/fuse_lowlevel.c > @@ -3588,6 +3588,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se, > res = fcntl(llp->pipe[0], F_SETPIPE_SZ, bufsize); > if (res == -1) { > llp->can_grow = 0; > + fuse_log(FUSE_LOG_ERR, "cannot grow pipe\n"); > res = grow_pipe_to_max(llp->pipe[0]); > if (res > 0) > llp->size = res; > @@ -3678,6 +3679,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se, > > } else { > /* Don't overwrite buf->mem, as that would cause a leak */ > + fuse_log(FUSE_LOG_WARNING, "Using splice\n"); > buf->fd = tmpbuf.fd; > buf->flags = tmpbuf.flags; > } > @@ -3687,6 +3689,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se, > > fallback: > #endif > + fuse_log(FUSE_LOG_WARNING, "Splice fallback\n"); > if (!buf->mem) { > buf->mem = buf_alloc(se->bufsize, internal); > if (!buf->mem) { > > > And then run this again after > sudo sysctl -w fs.pipe-max-size=1052672 > > (Please don't change '/proc/sys/fs/fuse/max_pages_limit' > from default). > > And now we would need to know how many users either limit > max-pages + header to fit default pipe-max-size (1MB) or > increase max_pages_limit. Given there is no warning in > libfuse about the fallback from splice to buf copy, I doubt > many people know about that - who would change system > defaults without the knowledge? > My concern is that this would break backwards compatibility for the rare subset of users who use their own custom library instead of libfuse, who expect splice to work as-is and might not have this in-built fallback to buffer copies. Thanks, Joanne > > And then, I still doubt that copy-to-tmp-page-and-splice > is any faster than no-tmp-page-copy-but-copy-to-lib-fuse-buffer. > Especially as the tmp page copy is single threaded, I think. > But needs to be benchmarked. > > > Thanks, > Bernd > > >