Re: [HELP] FUSE writeback performance bottleneck

Bernd Schubert <bernd.schubert@xxxxxxxxxxx> · Mon, 3 Jun 2024 16:43:17 +0200

On 6/3/24 08:17, Jingbo Xu wrote:
> Hi, Miklos,
> 
> We spotted a performance bottleneck for FUSE writeback in which the
> writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
> used for copy_page().
> 
> fuse_writepages_fill
>   alloc tmp_page
>   copy_highpage
> 
> This is because of FUSE writeback design (see commit 3be5a52b30aa
> ("fuse: support writable mmap")), which newly allocates a temp page for
> each dirty page to be written back, copy content of dirty page to temp
> page, and then write back the temp page instead.  This special design is
> intentional to avoid potential deadlocked due to buggy or even malicious
> fuse user daemon.

I also noticed that and I admin that I don't understand it yet. The commit says

<quote>
    The basic problem is that there can be no guarantee about the time in which
    the userspace filesystem will complete a write.  It may be buggy or even
    malicious, and fail to complete WRITE requests.  We don't want unrelated parts
    of the system to grind to a halt in such cases.
</quote>

Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees
how fast storage is?

Buggy - hmm yeah, then it is splice related only? But I think splice feature was
not introduced yet when fuse got mmap and writeback in 2008? 
Without splice the pages are just copied into a userspace buffer? So what can
userspace do wrong with its copy?

Failure - why can't it do what nfs_mapping_set_error() does?

I guess I miss something, but so far I don't understand what that is.

> 
> There was a proposal of removing this constraint for virtiofs [1], which
> is reasonable as users of virtiofs and virtiofs daemon don't run on the
> same OS, and virtiofs daemon is usually offered by cloud vendors that
> shall not be malicious.  While for the normal /dev/fuse interface, I
> don't think removing the constraint is acceptable.
> 
> 
> Come back to the writeback performance bottleneck.  Another important
> factor is that, (IIUC) only one kworker at the same time is allowed for
> writeback for each filesystem instance (if cgroup writeback is not
> enabled).  The kworker is scheduled upon sb->s_bdi->wb.dwork, and the
> workqueue infrastructure guarantees that at most one *running* worker is
> allowed for one specific work (sb->s_bdi->wb.dwork) at any time.  Thus
> the writeback is constraint to one CPU for each filesystem instance.
> 
> I'm not sure if offloading the page copying and then FUSE requests
> sending to another worker (if a bunch of dirty pages have been
> collected) is a good idea or not, e.g.
> 
> ```
> fuse_writepages_fill
>   if fuse_writepage_need_send:
>     # schedule a work
> 
> # the worker
> for each dirty page in ap->pages[]:
>     copy_page
> fuse_writepages_send
> ```
> 
> Any suggestion?
> 
> 
> 
> This issue can be reproduced by:
> 
> 1 ./libfuse/build/example/passthrough_ll -o cache=always -o writeback -o
> source=/run/ /mnt
> ("/run/" is a tmpfs mount)
> 
> 2 fio --name=write_test --ioengine=psync --iodepth=1 --rw=write --bs=1M
> --direct=0 --size=1G --numjobs=2 --group_reporting --directory=/mnt
> (at least two threads are needed; fio shows ~1800MiB/s buffer write
> bandwidth)

That should quickly run out of tmpfs memory. I need to find time to improve
this a bit, but this should give you an easier test:

https://github.com/libfuse/libfuse/pull/807

> 
> 
> [1]
> https://lore.kernel.org/all/20231228123528.705-1-lege.wang@xxxxxxxxxxxxxxx/
> 
> 

Thanks,
Bernd