On Thu, Nov 14, 2024 at 6:18 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > > > > On 11/15/24 2:19 AM, Joanne Koong wrote: > > On Wed, Nov 13, 2024 at 5:47 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > >> > >> > >> > >> On 11/14/24 8:39 AM, Joanne Koong wrote: > >>> On Tue, Nov 12, 2024 at 1:25 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > >>>> > >>>> Hi Joanne, > >>>> > >>>> On 11/8/24 7:56 AM, Joanne Koong wrote: > >>>>> Currently, we allocate and copy data to a temporary folio when > >>>>> handling writeback in order to mitigate the following deadlock scenario > >>>>> that may arise if reclaim waits on writeback to complete: > >>>>> * single-threaded FUSE server is in the middle of handling a request > >>>>> that needs a memory allocation > >>>>> * memory allocation triggers direct reclaim > >>>>> * direct reclaim waits on a folio under writeback > >>>>> * the FUSE server can't write back the folio since it's stuck in > >>>>> direct reclaim > >>>>> > >>>>> To work around this, we allocate a temporary folio and copy over the > >>>>> original folio to the temporary folio so that writeback can be > >>>>> immediately cleared on the original folio. This additionally requires us > >>>>> to maintain an internal rb tree to keep track of writeback state on the > >>>>> temporary folios. > >>>>> > >>>>> A recent change prevents reclaim logic from waiting on writeback for > >>>>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it. > >>>>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which > >>>>> will prevent FUSE folios from running into the reclaim deadlock described > >>>>> above) and removes the temporary folio + extra copying and the internal > >>>>> rb tree. > >>>>> > >>>>> fio benchmarks -- > >>>>> (using averages observed from 10 runs, throwing away outliers) > >>>>> > >>>>> Setup: > >>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount > >>>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount > >>>>> > >>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G > >>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount > >>>>> > >>>>> bs = 1k 4k 1M > >>>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s > >>>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s > >>>>> % diff -3% 23% 45% > >>>>> > >>>>> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > >>>> > >>>> I think there are some places checking or waiting for writeback could be > >>>> reconsidered if they are still needed or not after we dropping the temp > >>>> page design. > >>>> > >>>> As they are inherited from the original implementation, at least they > >>>> are harmless. I think they could be remained in this patch, and could > >>>> be cleaned up later if really needed. > >>>> > >>> > >>> Thank you for the thorough inspection! > >>> > >>>> > >>>>> @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio) > >>>>> * have writeback that extends beyond the lifetime of the folio. So > >>>>> * make sure we read a properly synced folio. > >>>>> */ > >>>>> - fuse_wait_on_folio_writeback(inode, folio); > >>>>> + folio_wait_writeback(folio); > >>>> > >>>> I doubt if wait-on-writeback is needed here, as now page cache won't be > >>>> freed until the writeback IO completes. > >>>> > >>>> The routine attempts to free page cache, e.g. invalidate_inode_pages2() > >>>> (generally called by distributed filesystems when the file content has > >>>> been modified from remote) or truncate_inode_pages() (called from > >>>> truncate(2) or inode eviction) will wait for writeback completion (if > >>>> any) before freeing page. > >>>> > >>>> Thus I don't think there's any possibility that .read_folio() or > >>>> .readahead() will be called when the writeback has not completed. > >>>> > >>> > >>> Great point. I'll remove this line and the comment above it too. > >>> > >>>> > >>>>> @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia, > >>>>> int err; > >>>>> > >>>>> for (i = 0; i < ap->num_folios; i++) > >>>>> - fuse_wait_on_folio_writeback(inode, ap->folios[i]); > >>>>> + folio_wait_writeback(ap->folios[i]); > >>>> > >>>> Ditto. > >> > >> Actually this is a typo and I originally meant that waiting for > >> writeback in fuse_send_readpages() is unneeded as page cache won't be > >> freed until the writeback IO completes. > >> > >>> - wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last)); > >>> + filemap_fdatawait_range(inode->i_mapping, first, last); > >> > > > > Gotcha and agreed. I'll remove this wait from readahead(). > > > >> > >> In fact the above waiting for writeback in fuse_send_write_pages() is > >> needed. The reason is as follows: > >> > >>>> > >>> > >>> Why did we need this fuse_wait_on_folio_writeback() even when we had > >>> the temp pages? If I'm understanding it correctly, > >>> fuse_send_write_pages() is only called for the writethrough case (by > >>> fuse_perform_write()), so these folios would never even be under > >>> writeback, no? > >> > >> I think mmap write could make the page dirty and the writeback could be > >> triggered then. > >> > > > > Ohhh I see, thanks for the explanation. > > > >>> > >>>> > >>>> > >>>>> static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio, > >>>>> - struct folio *tmp_folio, uint32_t folio_index) > >>>>> + uint32_t folio_index) > >>>>> { > >>>>> struct inode *inode = folio->mapping->host; > >>>>> struct fuse_args_pages *ap = &wpa->ia.ap; > >>>>> > >>>>> - folio_copy(tmp_folio, folio); > >>>>> - > >>>>> - ap->folios[folio_index] = tmp_folio; > >>>>> + folio_get(folio); > >>>> > >>>> I still think this folio_get() here is harmless but redundant. > >>>> > >>>> Ditto page cache won't be freed before writeback completes. > >>>> > >>>> Besides, other .writepages() implementaions e.g. iomap_writepages() also > >>>> doen't get the refcount when constructing the writeback IO. > >>>> > >>> > >>> Point taken. I'll remove this then, since other .writepages() also > >>> don't obtain a refcount. > >>> > >>>> > >>>>> @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping, > >>>>> if (IS_ERR(folio)) > >>>>> goto error; > >>>>> > >>>>> - fuse_wait_on_page_writeback(mapping->host, folio->index); > >>>>> + folio_wait_writeback(folio); > >>>> > >>>> I also doubt if wait_on_writeback() is needed here, as now there won't > >>>> be duplicate writeback IOs for the same page. > >>> > >>> What prevents there from being a duplicate writeback write for the > >>> same page? This is the path I'm looking at: > >>> > >>> ksys_write() > >>> vfs_write() > >>> new_sync_write() > >>> op->write_iter() > >>> fuse_file_write_iter() > >>> fuse_cache_write_iter() > >>> generic_file_write_iter() > >>> __generic_file_write_iter() > >>> generic_perform_write() > >>> op->write_begin() > >>> fuse_write_begin() > >>> > >>> but I'm not seeing where there's anything that prevents a duplicate > >>> write from happening. > >> > >> I mean there won't be duplicate *writeback* rather than *write* for the > >> same page. You could write the page cache and make it dirty at the time > >> when the writeback for the same page is still on going, as long as we > >> can ensure that even when the page is dirtied again, there won't be a > >> duplicate writeback IO for the same page when the previous writeback IO > >> has not completed yet. > >> > > > > I think we still need this folio_wait_writeback() since we're calling > > fuse_do_readfolio() and removing the folio_wait_writeback() from > > fuse_do_readfolio(). else we could read back stale data if the > > writeback hasn't gone through yet. > > I think we could probably move the folio_wait_writeback() here in > > fuse_write_begin() to be right before the fuse_do_readfolio() call and > > skip waiting on writeback if we hit the "success" gotos. > > IIUC if cache hit when fuse_write_begin() is called, the page is marked > with PG_update which indicates that (.readpage()) the IO request reading > data from disk to page is already done. Thus fuse_write_begin() will do > nothing and return directly. > > > if (folio_test_uptodate(folio) || len >= folio_size(folio)) > > goto success; > > Otherwise fuse_do_readfolio() is called to read data when cache miss, > i.e. the page doesn't exist in the page cache before or it gets > invalidated previously. Since we can ensure that no page can be > invalidated until the writeback IO completes, we can ensure that there's > no in-flight writeback IO when fuse_do_readfolio() is called. > > Other folks can also correct me if I get wrong, since I also attempts to > figure out all these details about the page cache management recently. > You're right, this writeback wait is unneeded. I'll remove this for v5. Thanks, Joanne > > -- > Thanks, > Jingbo