On Sat, 23 Nov 2024 at 00:24, Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > The purpose of this patchset is to help make writeback-cache write > performance in FUSE filesystems as fast as possible. > > In the current FUSE writeback design (see commit 3be5a52b30aa > ("fuse: support writable mmap"))), a temp page is allocated for every dirty > page to be written back, the contents of the dirty page are copied over to the > temp page, and the temp page gets handed to the server to write back. This is > done so that writeback may be immediately cleared on the dirty page, and this > in turn is done for two reasons: > a) in order to mitigate the following deadlock scenario that may arise if > reclaim waits on writeback on the dirty page to complete (more details can be > found in this thread [1]): > * single-threaded FUSE server is in the middle of handling a request > that needs a memory allocation > * memory allocation triggers direct reclaim > * direct reclaim waits on a folio under writeback > * the FUSE server can't write back the folio since it's stuck in > direct reclaim > b) in order to unblock internal (eg sync, page compaction) waits on writeback > without needing the server to complete writing back to disk, which may take > an indeterminate amount of time. > > Allocating and copying dirty pages to temp pages is the biggest performance > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page > altogether (which will also allow us to get rid of the internal FUSE rb tree > that is needed to keep track of writeback status on the temp pages). > Benchmarks show approximately a 20% improvement in throughput for 4k > block-size writes and a 45% improvement for 1M block-size writes. > > With removing the temp page, writeback state is now only cleared on the dirty > page after the server has written it back to disk. This may take an > indeterminate amount of time. As well, there is also the possibility of > malicious or well-intentioned but buggy servers where writeback may in the > worst case scenario, never complete. This means that any > folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to > be carefully audited. > > In particular, these are the cases that need to be accounted for: > * potentially deadlocking in reclaim, as mentioned above > * potentially stalling sync(2) > * potentially stalling page migration / compaction > > This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which > filesystems may set on its inode mappings to indicate that writeback > operations may take an indeterminate amount of time to complete. FUSE will set > this flag on its mappings. This patchset adds checks to the critical parts of > reclaim, sync, and page migration logic where writeback may be waited on. > > Please note the following: > * For sync(2), waiting on writeback will be skipped for FUSE, but this has no > effect on existing behavior. Dirty FUSE pages are already not guaranteed to > be written to disk by the time sync(2) returns (eg writeback is cleared on > the dirty page but the server may not have written out the temp page to disk > yet). If the caller wishes to ensure the data has actually been synced to > disk, they should use fsync(2)/fdatasync(2) instead. > * AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be > waited on when in writeback. There are some cases where the wait is > desirable. For example, for the sync_file_range() syscall, it is fine to > wait on the writeback since the caller passes in a fd for the operation. Looks good, thanks. Acked-by: Miklos Szeredi <mszeredi@xxxxxxxxxx> I think this should go via the mm tree. Thanks, Miklos