Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Dec 13, 2024 at 8:47 AM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
>
> +Andrew
>
> On Fri, Dec 13, 2024 at 12:52:44PM +0100, Miklos Szeredi wrote:
> > On Sat, 23 Nov 2024 at 00:24, Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> > >
> > > The purpose of this patchset is to help make writeback-cache write
> > > performance in FUSE filesystems as fast as possible.
> > >
> > > In the current FUSE writeback design (see commit 3be5a52b30aa
> > > ("fuse: support writable mmap"))), a temp page is allocated for every dirty
> > > page to be written back, the contents of the dirty page are copied over to the
> > > temp page, and the temp page gets handed to the server to write back. This is
> > > done so that writeback may be immediately cleared on the dirty page, and this
> > > in turn is done for two reasons:
> > > a) in order to mitigate the following deadlock scenario that may arise if
> > > reclaim waits on writeback on the dirty page to complete (more details can be
> > > found in this thread [1]):
> > > * single-threaded FUSE server is in the middle of handling a request
> > >   that needs a memory allocation
> > > * memory allocation triggers direct reclaim
> > > * direct reclaim waits on a folio under writeback
> > > * the FUSE server can't write back the folio since it's stuck in
> > >   direct reclaim
> > > b) in order to unblock internal (eg sync, page compaction) waits on writeback
> > > without needing the server to complete writing back to disk, which may take
> > > an indeterminate amount of time.
> > >
> > > Allocating and copying dirty pages to temp pages is the biggest performance
> > > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page
> > > altogether (which will also allow us to get rid of the internal FUSE rb tree
> > > that is needed to keep track of writeback status on the temp pages).
> > > Benchmarks show approximately a 20% improvement in throughput for 4k
> > > block-size writes and a 45% improvement for 1M block-size writes.
> > >
> > > With removing the temp page, writeback state is now only cleared on the dirty
> > > page after the server has written it back to disk. This may take an
> > > indeterminate amount of time. As well, there is also the possibility of
> > > malicious or well-intentioned but buggy servers where writeback may in the
> > > worst case scenario, never complete. This means that any
> > > folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to
> > > be carefully audited.
> > >
> > > In particular, these are the cases that need to be accounted for:
> > > * potentially deadlocking in reclaim, as mentioned above
> > > * potentially stalling sync(2)
> > > * potentially stalling page migration / compaction
> > >
> > > This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which
> > > filesystems may set on its inode mappings to indicate that writeback
> > > operations may take an indeterminate amount of time to complete. FUSE will set
> > > this flag on its mappings. This patchset adds checks to the critical parts of
> > > reclaim, sync, and page migration logic where writeback may be waited on.
> > >
> > > Please note the following:
> > > * For sync(2), waiting on writeback will be skipped for FUSE, but this has no
> > >   effect on existing behavior. Dirty FUSE pages are already not guaranteed to
> > >   be written to disk by the time sync(2) returns (eg writeback is cleared on
> > >   the dirty page but the server may not have written out the temp page to disk
> > >   yet). If the caller wishes to ensure the data has actually been synced to
> > >   disk, they should use fsync(2)/fdatasync(2) instead.
> > > * AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be
> > >   waited on when in writeback. There are some cases where the wait is
> > >   desirable. For example, for the sync_file_range() syscall, it is fine to
> > >   wait on the writeback since the caller passes in a fd for the operation.
> >
> > Looks good, thanks.
> >
> > Acked-by: Miklos Szeredi <mszeredi@xxxxxxxxxx>
> >
> > I think this should go via the mm tree.
>
> Andrew, can you please pick this series up or Joanne can send an updated
> version with all Acks/Review tag collected? Let us know what you prefer.
>

Hi Andrew,

Could you let us know your preference or if there's anything else you
need from us to proceed?


Thanks,
Joanne

> Thanks,
> Shakeel





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux