On 10/30/24 17:04, Joanne Koong wrote: > On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert > <bernd.schubert@xxxxxxxxxxx> wrote: >> >> On 10/28/24 22:58, Joanne Koong wrote: >>> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: >>>> >>>>> Same here, I need to look some more into the compaction / page >>>>> migration paths. I'm planning to do this early next week and will >>>>> report back with what I find. >>>>> >>>> >>>> These are my notes so far: >>>> >>>> * We hit the folio_wait_writeback() path when callers call >>>> migrate_pages() with mode MIGRATE_SYNC >>>> ... -> migrate_pages() -> migrate_pages_sync() -> >>>> migrate_pages_batch() -> migrate_folio_unmap() -> >>>> folio_wait_writeback() >>>> >>>> * These are the places where we call migrate_pages(): >>>> 1) demote_folio_list() >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode >>>> >>>> 2) __damon_pa_migrate_folio_list() >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode >>>> >>>> 3) migrate_misplaced_folio() >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode >>>> >>>> 4) do_move_pages_to_node() >>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but >>>> this path is only invoked by the move_pages() syscall. It's fine to >>>> wait on writeback for the move_pages() syscall since the user would >>>> have to deliberately invoke this on the fuse server for this to apply >>>> to the server's fuse folios >>>> >>>> 5) migrate_to_node() >>>> Can ignore this for the same reason as in 4. This path is only invoked >>>> by the migrate_pages() syscall. >>>> >>>> 6) do_mbind() >>>> Can ignore this for the same reason as 4 and 5. This path is only >>>> invoked by the mbind() syscall. >>>> >>>> 7) soft_offline_in_use_page() >>>> Can skip soft offlining fuse folios (eg folios with the >>>> AS_NO_WRITEBACK_WAIT mapping flag set). >>>> The path for this is soft_offline_page() -> soft_offline_in_use_page() >>>> -> migrate_pages(). soft_offline_page() only invokes this for in-use >>>> pages in a well-defined state (see ret value of get_hwpoison_page()). >>>> My understanding of soft offlining pages is that it's a mitigation >>>> strategy for handling pages that are experiencing errors but are not >>>> yet completely unusable, and its main purpose is to prevent future >>>> issues. It seems fine to skip this for fuse folios. >>>> >>>> 8) do_migrate_range() >>>> 9) compact_zone() >>>> 10) migrate_longterm_unpinnable_folios() >>>> 11) __alloc_contig_migrate_range() >>>> >>>> 8 to 11 needs more investigation / thinking about. I don't see a good >>>> way around these tbh. I think we have to operate under the assumption >>>> that the fuse server running is malicious or benevolently but >>>> incorrectly written and could possibly never complete writeback. So we >>>> definitely can't wait on these but it also doesn't seem like we can >>>> skip waiting on these, especially for the case where the server uses >>>> spliced pages, nor does it seem like we can just fail these with >>>> -EBUSY or something. >> >> I see some code paths with -EAGAIN in migration. Could you explain why >> we can't just fail migration for fuse write-back pages? >> Hi Joanne, thanks a lot for your quick reply (especially as my reviews come in very late). > > My understanding (and please correct me here Shakeel if I'm wrong) is > that this could block system optimizations, especially since if an > unprivileged malicious fuse server never replies to the writeback > request, then this completely stalls progress. In the best case > scenario, -EAGAIN could be used because the server might just be slow > in serving the writeback, but I think we need to also account for > servers that never complete the writeback. For > __alloc_contig_migrate_range() for example, my understanding is that > this is used to migrate pages so that there are more physically > contiguous ranges of memory freed up. If fuse writeback blocks that, > then that hurts system health overall. Hmm, I wonder what is worse - tmp page copies or missing compaction. Especially if we expect a low range of in-writeback pages/folios. One could argue that an evil user might spawn many fuse server processes to work around the default low fuse write-back limits, but does that make any difference with tmp pages? And these cannot be compacted either? And with timeouts that would be so far totally uncritical, I think. You also mentioned > especially for the case where the server uses spliced pages could you provide more details for that? > >>>> >>> >>> I'm still not seeing a good way around this. >>> >>> What about this then? We add a new fuse sysctl called something like >>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys >>> admin sets this, then it opts into optimizing writeback to be as fast >>> as possible (eg skipping the page copies) and if the server doesn't >>> fulfill the writeback by the set timeout value, then the connection is >>> aborted. >>> >>> Alternatively, we could also repurpose >>> /proc/sys/fs/fuse/max_request_timeout from the request timeout >>> patchset [1] but I like the additional flexibility and explicitness >>> having the "writeback_optimization_timeout" sysctl gives. >>> >>> Any thoughts on this? >> >> >> I'm a bit worried that we might lock up the system until time out is >> reached - not ideal. Especially as timeouts are in minutes now. But >> even a slightly stuttering video system not be great. I think we >> should give users/admin the choice then, if they prefer slow page >> copies or fast, but possibly shortly unresponsive system. >> > I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout > would be in seconds, where the sys admin would probably set something > more reasonable like 5 seconds or so. > If this syctl value is set, then servers who want writebacks to be > fast can opt into it at mount time (and by doing so agree that they > will service writeback requests by the timeout or their connection > will be aborted). I think your current patch set has it in minutes? (Should be easy enough to change that.) Though I'm more worried about the impact of _frequent_ timeout scanning through the different fuse lists on performance, than about missing compaction for folios that are currently in write-back. Thanks, Bernd