Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree

Joanne Koong <joannelkoong@xxxxxxxxx> · Mon, 28 Oct 2024 14:58:14 -0700

On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
>
> On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> >
> > On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> > >
> > >
> > >
> > > On 10/25/24 12:54 AM, Joanne Koong wrote:
> > > > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> > > >>
> > > >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
> > > >>>
> > > >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
> > > >>>
> > > >>>> I feel like this is too much restrictive and I am still not sure why
> > > >>>> blocking on fuse folios served by non-privileges fuse server is worse
> > > >>>> than blocking on folios served from the network.
> > > >>>
> > > >>> Might be.  But historically fuse had this behavior and I'd be very
> > > >>> reluctant to change that unconditionally.
> > > >>>
> > > >>> With a systemwide maximal timeout for fuse requests it might make
> > > >>> sense to allow sync(2), etc. to wait for fuse writeback.
> > > >>>
> > > >>> Without a timeout allowing fuse servers to block sync(2) indefinitely
> > > >>> seems rather risky.
> > > >>
> > > >> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
> > > >> That seems in line with the sync(2) documentation Jingbo referenced
> > > >> earlier where it states "The writing, although scheduled, is not
> > > >> necessarily complete upon return from sync()."
> > > >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
> > > >>
> > > >
> > > > So I think the answer to this is "no" for Linux. What the Linux man
> > > > page for sync(2) says:
> > > >
> > > > "According to the standard specification (e.g., POSIX.1-2001), sync()
> > > > schedules the writes, but may return before the actual writing is
> > > > done.  However Linux waits for I/O completions, and thus sync() or
> > > > syncfs() provide the same guarantees as fsync() called on every file
> > > > in the system or filesystem respectively." [1]
> > >
> > > Actually as for FUSE, IIUC the writeback is not guaranteed to be
> > > completed when sync(2) returns since the temp page mechanism.  When
> > > sync(2) returns, PG_writeback is indeed cleared for all original pages
> > > (in the address_space), while the real writeback work (initiated from
> > > temp page) may be still in progress.
> > >
> >
> > That's a great point. It seems like we can just skip waiting on
> > writeback to finish for fuse folios in sync(2) altogether then. I'll
> > look into what's the best way to do this.
> >
> > > I think this is also what Miklos means in:
> > > https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@xxxxxxxxxxxxxx/
> > >
> > > Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages
> > > in sync(2) codepath similar to what we have done for the direct reclaim
> > > in patch 1.
> > >
> > >
> > > >
> > > > Regardless of the compaction / page migration issue then, this
> > > > blocking sync(2) is a dealbreaker.
> > >
> > > I really should have figureg out the compaction / page migration
> > > mechanism and the potential impact to FUSE when we dropping the temp
> > > page.  Just too busy to take some time on this though.....
> >
> > Same here, I need to look some more into the compaction / page
> > migration paths. I'm planning to do this early next week and will
> > report back with what I find.
> >
>
> These are my notes so far:
>
> * We hit the folio_wait_writeback() path when callers call
> migrate_pages() with mode MIGRATE_SYNC
>    ... -> migrate_pages() -> migrate_pages_sync() ->
> migrate_pages_batch() -> migrate_folio_unmap() ->
> folio_wait_writeback()
>
> * These are the places where we call migrate_pages():
> 1) demote_folio_list()
> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>
> 2) __damon_pa_migrate_folio_list()
> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>
> 3) migrate_misplaced_folio()
> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>
> 4) do_move_pages_to_node()
> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
> this path is only invoked by the move_pages() syscall. It's fine to
> wait on writeback for the move_pages() syscall since the user would
> have to deliberately invoke this on the fuse server for this to apply
> to the server's fuse folios
>
> 5)  migrate_to_node()
> Can ignore this for the same reason as in 4. This path is only invoked
> by the migrate_pages() syscall.
>
> 6) do_mbind()
> Can ignore this for the same reason as 4 and 5. This path is only
> invoked by the mbind() syscall.
>
> 7) soft_offline_in_use_page()
> Can skip soft offlining fuse folios (eg folios with the
> AS_NO_WRITEBACK_WAIT mapping flag set).
> The path for this is soft_offline_page() -> soft_offline_in_use_page()
> -> migrate_pages(). soft_offline_page() only invokes this for in-use
> pages in a well-defined state (see ret value of get_hwpoison_page()).
> My understanding of soft offlining pages is that it's a mitigation
> strategy for handling pages that are experiencing errors but are not
> yet completely unusable, and its main purpose is to prevent future
> issues. It seems fine to skip this for fuse folios.
>
> 8) do_migrate_range()
> 9) compact_zone()
> 10) migrate_longterm_unpinnable_folios()
> 11) __alloc_contig_migrate_range()
>
> 8 to 11 needs more investigation / thinking about. I don't see a good
> way around these tbh. I think we have to operate under the assumption
> that the fuse server running is malicious or benevolently but
> incorrectly written and could possibly never complete writeback. So we
> definitely can't wait on these but it also doesn't seem like we can
> skip waiting on these, especially for the case where the server uses
> spliced pages, nor does it seem like we can just fail these with
> -EBUSY or something.
>

I'm still not seeing a good way around this.

What about this then? We add a new fuse sysctl called something like
"/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
admin sets this, then it opts into optimizing writeback to be as fast
as possible (eg skipping the page copies) and if the server doesn't
fulfill the writeback by the set timeout value, then the connection is
aborted.

Alternatively, we could also repurpose
/proc/sys/fs/fuse/max_request_timeout from the request timeout
patchset [1] but I like the additional flexibility and explicitness
having the "writeback_optimization_timeout" sysctl gives.

Any thoughts on this?

[1] https://lore.kernel.org/linux-fsdevel/20241011191320.91592-4-joannelkoong@xxxxxxxxx/

Thanks,
Joanne

> Will continue looking more into this early next week.
>
> Thanks,
> Joanne
> >
> > Thanks,
> > Joanne
> > >
> > >
> > > --
> > > Thanks,
> > > Jingbo