Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree

Joanne Koong <joannelkoong@xxxxxxxxx> · Wed, 30 Oct 2024 09:04:59 -0700

On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert
<bernd.schubert@xxxxxxxxxxx> wrote:
>
> On 10/28/24 22:58, Joanne Koong wrote:
> > On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> >>
> >>> Same here, I need to look some more into the compaction / page
> >>> migration paths. I'm planning to do this early next week and will
> >>> report back with what I find.
> >>>
> >>
> >> These are my notes so far:
> >>
> >> * We hit the folio_wait_writeback() path when callers call
> >> migrate_pages() with mode MIGRATE_SYNC
> >>    ... -> migrate_pages() -> migrate_pages_sync() ->
> >> migrate_pages_batch() -> migrate_folio_unmap() ->
> >> folio_wait_writeback()
> >>
> >> * These are the places where we call migrate_pages():
> >> 1) demote_folio_list()
> >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>
> >> 2) __damon_pa_migrate_folio_list()
> >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>
> >> 3) migrate_misplaced_folio()
> >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>
> >> 4) do_move_pages_to_node()
> >> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
> >> this path is only invoked by the move_pages() syscall. It's fine to
> >> wait on writeback for the move_pages() syscall since the user would
> >> have to deliberately invoke this on the fuse server for this to apply
> >> to the server's fuse folios
> >>
> >> 5)  migrate_to_node()
> >> Can ignore this for the same reason as in 4. This path is only invoked
> >> by the migrate_pages() syscall.
> >>
> >> 6) do_mbind()
> >> Can ignore this for the same reason as 4 and 5. This path is only
> >> invoked by the mbind() syscall.
> >>
> >> 7) soft_offline_in_use_page()
> >> Can skip soft offlining fuse folios (eg folios with the
> >> AS_NO_WRITEBACK_WAIT mapping flag set).
> >> The path for this is soft_offline_page() -> soft_offline_in_use_page()
> >> -> migrate_pages(). soft_offline_page() only invokes this for in-use
> >> pages in a well-defined state (see ret value of get_hwpoison_page()).
> >> My understanding of soft offlining pages is that it's a mitigation
> >> strategy for handling pages that are experiencing errors but are not
> >> yet completely unusable, and its main purpose is to prevent future
> >> issues. It seems fine to skip this for fuse folios.
> >>
> >> 8) do_migrate_range()
> >> 9) compact_zone()
> >> 10) migrate_longterm_unpinnable_folios()
> >> 11) __alloc_contig_migrate_range()
> >>
> >> 8 to 11 needs more investigation / thinking about. I don't see a good
> >> way around these tbh. I think we have to operate under the assumption
> >> that the fuse server running is malicious or benevolently but
> >> incorrectly written and could possibly never complete writeback. So we
> >> definitely can't wait on these but it also doesn't seem like we can
> >> skip waiting on these, especially for the case where the server uses
> >> spliced pages, nor does it seem like we can just fail these with
> >> -EBUSY or something.
>
> I see some code paths with -EAGAIN in migration. Could you explain why
> we can't just fail migration for fuse write-back pages?
>

My understanding (and please correct me here Shakeel if I'm wrong) is
that this could block system optimizations, especially since if an
unprivileged malicious fuse server never replies to the writeback
request, then this completely stalls progress. In the best case
scenario, -EAGAIN could be used because the server might just be slow
in serving the writeback, but I think we need to also account for
servers that never complete the writeback. For
__alloc_contig_migrate_range() for example, my understanding is that
this is used to migrate pages so that there are more physically
contiguous ranges of memory freed up. If fuse writeback blocks that,
then that hurts system health overall.

> >>
> >
> > I'm still not seeing a good way around this.
> >
> > What about this then? We add a new fuse sysctl called something like
> > "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
> > admin sets this, then it opts into optimizing writeback to be as fast
> > as possible (eg skipping the page copies) and if the server doesn't
> > fulfill the writeback by the set timeout value, then the connection is
> > aborted.
> >
> > Alternatively, we could also repurpose
> > /proc/sys/fs/fuse/max_request_timeout from the request timeout
> > patchset [1] but I like the additional flexibility and explicitness
> > having the "writeback_optimization_timeout" sysctl gives.
> >
> > Any thoughts on this?
>
>
> I'm a bit worried that we might lock up the system until time out is
> reached - not ideal. Especially as timeouts are in minutes now. But
> even a slightly stuttering video system not be great. I think we
> should give users/admin the choice then, if they prefer slow page
> copies or fast, but possibly shortly unresponsive system.
>
I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout
would be in seconds, where the sys admin would probably set something
more reasonable like 5 seconds or so.
If this syctl value is set, then servers who want writebacks to be
fast can opt into it at mount time (and by doing so agree that they
will service writeback requests by the timeout or their connection
will be aborted).

Thanks,
Joanne
>
> Thank,
> Bernd