Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree

Joanne Koong <joannelkoong@xxxxxxxxx> · Fri, 15 Nov 2024 10:29:56 -0800

On Thu, Nov 14, 2024 at 6:18 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>
>
>
> On 11/15/24 2:19 AM, Joanne Koong wrote:
> > On Wed, Nov 13, 2024 at 5:47 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 11/14/24 8:39 AM, Joanne Koong wrote:
> >>> On Tue, Nov 12, 2024 at 1:25 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>> Hi Joanne,
> >>>>
> >>>> On 11/8/24 7:56 AM, Joanne Koong wrote:
> >>>>> Currently, we allocate and copy data to a temporary folio when
> >>>>> handling writeback in order to mitigate the following deadlock scenario
> >>>>> that may arise if reclaim waits on writeback to complete:
> >>>>> * single-threaded FUSE server is in the middle of handling a request
> >>>>>   that needs a memory allocation
> >>>>> * memory allocation triggers direct reclaim
> >>>>> * direct reclaim waits on a folio under writeback
> >>>>> * the FUSE server can't write back the folio since it's stuck in
> >>>>>   direct reclaim
> >>>>>
> >>>>> To work around this, we allocate a temporary folio and copy over the
> >>>>> original folio to the temporary folio so that writeback can be
> >>>>> immediately cleared on the original folio. This additionally requires us
> >>>>> to maintain an internal rb tree to keep track of writeback state on the
> >>>>> temporary folios.
> >>>>>
> >>>>> A recent change prevents reclaim logic from waiting on writeback for
> >>>>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> >>>>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> >>>>> will prevent FUSE folios from running into the reclaim deadlock described
> >>>>> above) and removes the temporary folio + extra copying and the internal
> >>>>> rb tree.
> >>>>>
> >>>>> fio benchmarks --
> >>>>> (using averages observed from 10 runs, throwing away outliers)
> >>>>>
> >>>>> Setup:
> >>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>>>
> >>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>>>
> >>>>>         bs =  1k          4k            1M
> >>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> >>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> >>>>> % diff        -3%          23%         45%
> >>>>>
> >>>>> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>
> >>>>
> >>>> I think there are some places checking or waiting for writeback could be
> >>>> reconsidered if they are still needed or not after we dropping the temp
> >>>> page design.
> >>>>
> >>>> As they are inherited from the original implementation, at least they
> >>>> are harmless.  I think they could be remained in this patch, and could
> >>>> be cleaned up later if really needed.
> >>>>
> >>>
> >>> Thank you for the thorough inspection!
> >>>
> >>>>
> >>>>> @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
> >>>>>        * have writeback that extends beyond the lifetime of the folio.  So
> >>>>>        * make sure we read a properly synced folio.
> >>>>>        */
> >>>>> -     fuse_wait_on_folio_writeback(inode, folio);
> >>>>> +     folio_wait_writeback(folio);
> >>>>
> >>>> I doubt if wait-on-writeback is needed here, as now page cache won't be
> >>>> freed until the writeback IO completes.
> >>>>
> >>>> The routine attempts to free page cache, e.g. invalidate_inode_pages2()
> >>>> (generally called by distributed filesystems when the file content has
> >>>> been modified from remote) or truncate_inode_pages() (called from
> >>>> truncate(2) or inode eviction) will wait for writeback completion (if
> >>>> any) before freeing page.
> >>>>
> >>>> Thus I don't think there's any possibility that .read_folio() or
> >>>> .readahead() will be called when the writeback has not completed.
> >>>>
> >>>
> >>> Great point. I'll remove this line and the comment above it too.
> >>>
> >>>>
> >>>>> @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
> >>>>>       int err;
> >>>>>
> >>>>>       for (i = 0; i < ap->num_folios; i++)
> >>>>> -             fuse_wait_on_folio_writeback(inode, ap->folios[i]);
> >>>>> +             folio_wait_writeback(ap->folios[i]);
> >>>>
> >>>> Ditto.
> >>
> >> Actually this is a typo and I originally meant that waiting for
> >> writeback in fuse_send_readpages() is unneeded as page cache won't be
> >> freed until the writeback IO completes.
> >>
> >>> -     wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
> >>> +     filemap_fdatawait_range(inode->i_mapping, first, last);
> >>
> >
> > Gotcha and agreed. I'll remove this wait from readahead().
> >
> >>
> >> In fact the above waiting for writeback in fuse_send_write_pages() is
> >> needed.  The reason is as follows:
> >>
> >>>>
> >>>
> >>> Why did we need this fuse_wait_on_folio_writeback() even when we had
> >>> the temp pages? If I'm understanding it correctly,
> >>> fuse_send_write_pages() is only called for the writethrough case (by
> >>> fuse_perform_write()), so these folios would never even be under
> >>> writeback, no?
> >>
> >> I think mmap write could make the page dirty and the writeback could be
> >> triggered then.
> >>
> >
> > Ohhh I see, thanks for the explanation.
> >
> >>>
> >>>>
> >>>>
> >>>>>  static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> >>>>> -                                       struct folio *tmp_folio, uint32_t folio_index)
> >>>>> +                                       uint32_t folio_index)
> >>>>>  {
> >>>>>       struct inode *inode = folio->mapping->host;
> >>>>>       struct fuse_args_pages *ap = &wpa->ia.ap;
> >>>>>
> >>>>> -     folio_copy(tmp_folio, folio);
> >>>>> -
> >>>>> -     ap->folios[folio_index] = tmp_folio;
> >>>>> +     folio_get(folio);
> >>>>
> >>>> I still think this folio_get() here is harmless but redundant.
> >>>>
> >>>> Ditto page cache won't be freed before writeback completes.
> >>>>
> >>>> Besides, other .writepages() implementaions e.g. iomap_writepages() also
> >>>> doen't get the refcount when constructing the writeback IO.
> >>>>
> >>>
> >>> Point taken. I'll remove this then, since other .writepages() also
> >>> don't obtain a refcount.
> >>>
> >>>>
> >>>>> @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
> >>>>>       if (IS_ERR(folio))
> >>>>>               goto error;
> >>>>>
> >>>>> -     fuse_wait_on_page_writeback(mapping->host, folio->index);
> >>>>> +     folio_wait_writeback(folio);
> >>>>
> >>>> I also doubt if wait_on_writeback() is needed here, as now there won't
> >>>> be duplicate writeback IOs for the same page.
> >>>
> >>> What prevents there from being a duplicate writeback write for the
> >>> same page? This is the path I'm looking at:
> >>>
> >>> ksys_write()
> >>>   vfs_write()
> >>>     new_sync_write()
> >>>        op->write_iter()
> >>>           fuse_file_write_iter()
> >>>             fuse_cache_write_iter()
> >>>                generic_file_write_iter()
> >>>                    __generic_file_write_iter()
> >>>                        generic_perform_write()
> >>>                            op->write_begin()
> >>>                                fuse_write_begin()
> >>>
> >>> but I'm not seeing where there's anything that prevents a duplicate
> >>> write from happening.
> >>
> >> I mean there won't be duplicate *writeback* rather than *write* for the
> >> same page.  You could write the page cache and make it dirty at the time
> >> when the writeback for the same page is still on going, as long as we
> >> can ensure that even when the page is dirtied again, there won't be a
> >> duplicate writeback IO for the same page when the previous writeback IO
> >> has not completed yet.
> >>
> >
> > I think we still need this folio_wait_writeback() since we're calling
> > fuse_do_readfolio() and removing the folio_wait_writeback() from
> > fuse_do_readfolio(). else we could read back stale data if the
> > writeback hasn't gone through yet.
> > I think we could probably move the folio_wait_writeback() here in
> > fuse_write_begin() to be right before the fuse_do_readfolio() call and
> > skip waiting on writeback if we hit the "success" gotos.
>
> IIUC if cache hit when fuse_write_begin() is called, the page is marked
> with PG_update which indicates that (.readpage()) the IO request reading
> data from disk to page is already done.  Thus fuse_write_begin() will do
> nothing and return directly.
>
> >       if (folio_test_uptodate(folio) || len >= folio_size(folio))
> >               goto success;
>
> Otherwise fuse_do_readfolio() is called to read data when cache miss,
> i.e. the page doesn't exist in the page cache before or it gets
> invalidated previously.  Since we can ensure that no page can be
> invalidated until the writeback IO completes, we can ensure that there's
> no in-flight writeback IO when fuse_do_readfolio() is called.
>
> Other folks can also correct me if I get wrong, since I also attempts to
> figure out all these details about the page cache management recently.
>

You're right, this writeback wait is unneeded. I'll remove this for v5.

Thanks,
Joanne
>
> --
> Thanks,
> Jingbo