Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree

Joanne Koong <joannelkoong@xxxxxxxxx> · Wed, 13 Nov 2024 16:39:09 -0800

On Tue, Nov 12, 2024 at 1:25 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>
> Hi Joanne,
>
> On 11/8/24 7:56 AM, Joanne Koong wrote:
> > Currently, we allocate and copy data to a temporary folio when
> > handling writeback in order to mitigate the following deadlock scenario
> > that may arise if reclaim waits on writeback to complete:
> > * single-threaded FUSE server is in the middle of handling a request
> >   that needs a memory allocation
> > * memory allocation triggers direct reclaim
> > * direct reclaim waits on a folio under writeback
> > * the FUSE server can't write back the folio since it's stuck in
> >   direct reclaim
> >
> > To work around this, we allocate a temporary folio and copy over the
> > original folio to the temporary folio so that writeback can be
> > immediately cleared on the original folio. This additionally requires us
> > to maintain an internal rb tree to keep track of writeback state on the
> > temporary folios.
> >
> > A recent change prevents reclaim logic from waiting on writeback for
> > folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> > This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> > will prevent FUSE folios from running into the reclaim deadlock described
> > above) and removes the temporary folio + extra copying and the internal
> > rb tree.
> >
> > fio benchmarks --
> > (using averages observed from 10 runs, throwing away outliers)
> >
> > Setup:
> > sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >
> > fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> > --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >
> >         bs =  1k          4k            1M
> > Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> > After   341 MiB/s     2246 MiB/s     2685 MiB/s
> > % diff        -3%          23%         45%
> >
> > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>
>
> I think there are some places checking or waiting for writeback could be
> reconsidered if they are still needed or not after we dropping the temp
> page design.
>
> As they are inherited from the original implementation, at least they
> are harmless.  I think they could be remained in this patch, and could
> be cleaned up later if really needed.
>

Thank you for the thorough inspection!

>
> > @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
> >        * have writeback that extends beyond the lifetime of the folio.  So
> >        * make sure we read a properly synced folio.
> >        */
> > -     fuse_wait_on_folio_writeback(inode, folio);
> > +     folio_wait_writeback(folio);
>
> I doubt if wait-on-writeback is needed here, as now page cache won't be
> freed until the writeback IO completes.
>
> The routine attempts to free page cache, e.g. invalidate_inode_pages2()
> (generally called by distributed filesystems when the file content has
> been modified from remote) or truncate_inode_pages() (called from
> truncate(2) or inode eviction) will wait for writeback completion (if
> any) before freeing page.
>
> Thus I don't think there's any possibility that .read_folio() or
> .readahead() will be called when the writeback has not completed.
>

Great point. I'll remove this line and the comment above it too.

>
> > @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
> >       int err;
> >
> >       for (i = 0; i < ap->num_folios; i++)
> > -             fuse_wait_on_folio_writeback(inode, ap->folios[i]);
> > +             folio_wait_writeback(ap->folios[i]);
>
> Ditto.
>

Why did we need this fuse_wait_on_folio_writeback() even when we had
the temp pages? If I'm understanding it correctly,
fuse_send_write_pages() is only called for the writethrough case (by
fuse_perform_write()), so these folios would never even be under
writeback, no?

>
>
> >  static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> > -                                       struct folio *tmp_folio, uint32_t folio_index)
> > +                                       uint32_t folio_index)
> >  {
> >       struct inode *inode = folio->mapping->host;
> >       struct fuse_args_pages *ap = &wpa->ia.ap;
> >
> > -     folio_copy(tmp_folio, folio);
> > -
> > -     ap->folios[folio_index] = tmp_folio;
> > +     folio_get(folio);
>
> I still think this folio_get() here is harmless but redundant.
>
> Ditto page cache won't be freed before writeback completes.
>
> Besides, other .writepages() implementaions e.g. iomap_writepages() also
> doen't get the refcount when constructing the writeback IO.
>

Point taken. I'll remove this then, since other .writepages() also
don't obtain a refcount.

>
> > @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
> >       if (IS_ERR(folio))
> >               goto error;
> >
> > -     fuse_wait_on_page_writeback(mapping->host, folio->index);
> > +     folio_wait_writeback(folio);
>
> I also doubt if wait_on_writeback() is needed here, as now there won't
> be duplicate writeback IOs for the same page.

What prevents there from being a duplicate writeback write for the
same page? This is the path I'm looking at:

ksys_write()
  vfs_write()
    new_sync_write()
       op->write_iter()
          fuse_file_write_iter()
            fuse_cache_write_iter()
               generic_file_write_iter()
                   __generic_file_write_iter()
                       generic_perform_write()
                           op->write_begin()
                               fuse_write_begin()

but I'm not seeing where there's anything that prevents a duplicate
write from happening.

>
>
> > @@ -2545,13 +2264,11 @@ static int fuse_launder_folio(struct folio *folio)
> >  {
> >       int err = 0;
> >       if (folio_clear_dirty_for_io(folio)) {
> > -             struct inode *inode = folio->mapping->host;
> > -
> >               /* Serialize with pending writeback for the same page */
> > -             fuse_wait_on_page_writeback(inode, folio->index);
> > +             folio_wait_writeback(folio);
>
> I think folio_wait_writeback() is unneeded after dropping the temp page
> copying.  This is introduced in commit
> 3993382bb3198cc5e263c3519418e716bd57b056 ("fuse: launder page should
> wait for page writeback") since .launder_page() could be called when the
> previous writeback of the same page has not completed yet.  Since now we
> won't clear PG_writeback until the writeback completes, .launder_page()
> won't be called on the same page when the corresponding writeback IO is
> still inflight.
>

Nice catch, I'll remove this in v4.

Thanks,
Joanne

>
> --
> Thanks,
> Jingbo