On Tue, Nov 12, 2024 at 1:25 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > > Hi Joanne, > > On 11/8/24 7:56 AM, Joanne Koong wrote: > > Currently, we allocate and copy data to a temporary folio when > > handling writeback in order to mitigate the following deadlock scenario > > that may arise if reclaim waits on writeback to complete: > > * single-threaded FUSE server is in the middle of handling a request > > that needs a memory allocation > > * memory allocation triggers direct reclaim > > * direct reclaim waits on a folio under writeback > > * the FUSE server can't write back the folio since it's stuck in > > direct reclaim > > > > To work around this, we allocate a temporary folio and copy over the > > original folio to the temporary folio so that writeback can be > > immediately cleared on the original folio. This additionally requires us > > to maintain an internal rb tree to keep track of writeback state on the > > temporary folios. > > > > A recent change prevents reclaim logic from waiting on writeback for > > folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it. > > This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which > > will prevent FUSE folios from running into the reclaim deadlock described > > above) and removes the temporary folio + extra copying and the internal > > rb tree. > > > > fio benchmarks -- > > (using averages observed from 10 runs, throwing away outliers) > > > > Setup: > > sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount > > ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount > > > > fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G > > --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount > > > > bs = 1k 4k 1M > > Before 351 MiB/s 1818 MiB/s 1851 MiB/s > > After 341 MiB/s 2246 MiB/s 2685 MiB/s > > % diff -3% 23% 45% > > > > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > > I think there are some places checking or waiting for writeback could be > reconsidered if they are still needed or not after we dropping the temp > page design. > > As they are inherited from the original implementation, at least they > are harmless. I think they could be remained in this patch, and could > be cleaned up later if really needed. > Thank you for the thorough inspection! > > > @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio) > > * have writeback that extends beyond the lifetime of the folio. So > > * make sure we read a properly synced folio. > > */ > > - fuse_wait_on_folio_writeback(inode, folio); > > + folio_wait_writeback(folio); > > I doubt if wait-on-writeback is needed here, as now page cache won't be > freed until the writeback IO completes. > > The routine attempts to free page cache, e.g. invalidate_inode_pages2() > (generally called by distributed filesystems when the file content has > been modified from remote) or truncate_inode_pages() (called from > truncate(2) or inode eviction) will wait for writeback completion (if > any) before freeing page. > > Thus I don't think there's any possibility that .read_folio() or > .readahead() will be called when the writeback has not completed. > Great point. I'll remove this line and the comment above it too. > > > @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia, > > int err; > > > > for (i = 0; i < ap->num_folios; i++) > > - fuse_wait_on_folio_writeback(inode, ap->folios[i]); > > + folio_wait_writeback(ap->folios[i]); > > Ditto. > Why did we need this fuse_wait_on_folio_writeback() even when we had the temp pages? If I'm understanding it correctly, fuse_send_write_pages() is only called for the writethrough case (by fuse_perform_write()), so these folios would never even be under writeback, no? > > > > static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio, > > - struct folio *tmp_folio, uint32_t folio_index) > > + uint32_t folio_index) > > { > > struct inode *inode = folio->mapping->host; > > struct fuse_args_pages *ap = &wpa->ia.ap; > > > > - folio_copy(tmp_folio, folio); > > - > > - ap->folios[folio_index] = tmp_folio; > > + folio_get(folio); > > I still think this folio_get() here is harmless but redundant. > > Ditto page cache won't be freed before writeback completes. > > Besides, other .writepages() implementaions e.g. iomap_writepages() also > doen't get the refcount when constructing the writeback IO. > Point taken. I'll remove this then, since other .writepages() also don't obtain a refcount. > > > @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping, > > if (IS_ERR(folio)) > > goto error; > > > > - fuse_wait_on_page_writeback(mapping->host, folio->index); > > + folio_wait_writeback(folio); > > I also doubt if wait_on_writeback() is needed here, as now there won't > be duplicate writeback IOs for the same page. What prevents there from being a duplicate writeback write for the same page? This is the path I'm looking at: ksys_write() vfs_write() new_sync_write() op->write_iter() fuse_file_write_iter() fuse_cache_write_iter() generic_file_write_iter() __generic_file_write_iter() generic_perform_write() op->write_begin() fuse_write_begin() but I'm not seeing where there's anything that prevents a duplicate write from happening. > > > > @@ -2545,13 +2264,11 @@ static int fuse_launder_folio(struct folio *folio) > > { > > int err = 0; > > if (folio_clear_dirty_for_io(folio)) { > > - struct inode *inode = folio->mapping->host; > > - > > /* Serialize with pending writeback for the same page */ > > - fuse_wait_on_page_writeback(inode, folio->index); > > + folio_wait_writeback(folio); > > I think folio_wait_writeback() is unneeded after dropping the temp page > copying. This is introduced in commit > 3993382bb3198cc5e263c3519418e716bd57b056 ("fuse: launder page should > wait for page writeback") since .launder_page() could be called when the > previous writeback of the same page has not completed yet. Since now we > won't clear PG_writeback until the writeback completes, .launder_page() > won't be called on the same page when the corresponding writeback IO is > still inflight. > Nice catch, I'll remove this in v4. Thanks, Joanne > > -- > Thanks, > Jingbo