On Fri, Nov 22, 2024 at 3:24 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > The purpose of this patchset is to help make writeback-cache write > performance in FUSE filesystems as fast as possible. > > In the current FUSE writeback design (see commit 3be5a52b30aa > ("fuse: support writable mmap"))), a temp page is allocated for every dirty > page to be written back, the contents of the dirty page are copied over to the > temp page, and the temp page gets handed to the server to write back. This is > done so that writeback may be immediately cleared on the dirty page, and this > in turn is done for two reasons: > a) in order to mitigate the following deadlock scenario that may arise if > reclaim waits on writeback on the dirty page to complete (more details can be > found in this thread [1]): > * single-threaded FUSE server is in the middle of handling a request > that needs a memory allocation > * memory allocation triggers direct reclaim > * direct reclaim waits on a folio under writeback > * the FUSE server can't write back the folio since it's stuck in > direct reclaim > b) in order to unblock internal (eg sync, page compaction) waits on writeback > without needing the server to complete writing back to disk, which may take > an indeterminate amount of time. > > Allocating and copying dirty pages to temp pages is the biggest performance > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page > altogether (which will also allow us to get rid of the internal FUSE rb tree > that is needed to keep track of writeback status on the temp pages). > Benchmarks show approximately a 20% improvement in throughput for 4k > block-size writes and a 45% improvement for 1M block-size writes. > > With removing the temp page, writeback state is now only cleared on the dirty > page after the server has written it back to disk. This may take an > indeterminate amount of time. As well, there is also the possibility of > malicious or well-intentioned but buggy servers where writeback may in the > worst case scenario, never complete. This means that any > folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to > be carefully audited. > > In particular, these are the cases that need to be accounted for: > * potentially deadlocking in reclaim, as mentioned above > * potentially stalling sync(2) > * potentially stalling page migration / compaction > > This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which > filesystems may set on its inode mappings to indicate that writeback > operations may take an indeterminate amount of time to complete. FUSE will set > this flag on its mappings. This patchset adds checks to the critical parts of > reclaim, sync, and page migration logic where writeback may be waited on. > > Please note the following: > * For sync(2), waiting on writeback will be skipped for FUSE, but this has no > effect on existing behavior. Dirty FUSE pages are already not guaranteed to > be written to disk by the time sync(2) returns (eg writeback is cleared on > the dirty page but the server may not have written out the temp page to disk > yet). If the caller wishes to ensure the data has actually been synced to > disk, they should use fsync(2)/fdatasync(2) instead. > * AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be > waited on when in writeback. There are some cases where the wait is > desirable. For example, for the sync_file_range() syscall, it is fine to > wait on the writeback since the caller passes in a fd for the operation. > > [1] > https://lore.kernel.org/linux-kernel/495d2400-1d96-4924-99d3-8b2952e05fc3@xxxxxxxxxxxxxxxxx/ > > Changelog > --------- > v5: > https://lore.kernel.org/linux-fsdevel/20241115224459.427610-1-joannelkoong@xxxxxxxxx/ > Changes from v5 -> v6: > * Add Shakeel and Jingbo's reviewed-bys > * Move folio_end_writeback() to fuse_writepage_finish() (Jingbo) > * Embed fuse_writepage_finish_stat() logic inline (Jingbo) > * Remove node_stat NR_WRITEBACK inc/sub (Jingbo) > > v4: > https://lore.kernel.org/linux-fsdevel/20241107235614.3637221-1-joannelkoong@xxxxxxxxx/ > Changes from v4 -> v5: > * AS_WRITEBACK_MAY_BLOCK -> AS_WRITEBACK_INDETERMINATE (Shakeel) > * Drop memory hotplug patch (David and Shakeel) > * Remove some more kunnecessary writeback waits in fuse code (Jingbo) > * Make commit message for reclaim patch more concise - drop part about > deadlock and just focus on how it may stall waits > > v3: > https://lore.kernel.org/linux-fsdevel/20241107191618.2011146-1-joannelkoong@xxxxxxxxx/ > Changes from v3 -> v4: > * Use filemap_fdatawait_range() instead of filemap_range_has_writeback() in > readahead > > v2: > https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelkoong@xxxxxxxxx/ > Changes from v2 -> v3: > * Account for sync and page migration cases as well (Miklos) > * Change AS_NO_WRITEBACK_RECLAIM to the more generic AS_WRITEBACK_MAY_BLOCK > * For fuse inodes, set mapping_writeback_may_block only if fc->writeback_cache > is enabled > > v1: > https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@xxxxxxxxx/T/#t > Changes from v1 -> v2: > * Have flag in "enum mapping_flags" instead of creating asop_flags (Shakeel) > * Set fuse inodes to use AS_NO_WRITEBACK_RECLAIM (Shakeel) > > Joanne Koong (5): > mm: add AS_WRITEBACK_INDETERMINATE mapping flag > mm: skip reclaiming folios in legacy memcg writeback indeterminate > contexts > fs/writeback: in wait_sb_inodes(), skip wait for > AS_WRITEBACK_INDETERMINATE mappings > mm/migrate: skip migrating folios under writeback with > AS_WRITEBACK_INDETERMINATE mappings > fuse: remove tmp folio for writebacks and internal rb tree > > fs/fs-writeback.c | 3 + > fs/fuse/file.c | 360 ++++------------------------------------ > fs/fuse/fuse_i.h | 3 - > include/linux/pagemap.h | 11 ++ > mm/migrate.c | 5 +- > mm/vmscan.c | 10 +- > 6 files changed, 53 insertions(+), 339 deletions(-) > Miklos, may I get your thoughts on this patchset? Thanks, Joanne > -- > 2.43.5 >