On 6/3/24 08:17, Jingbo Xu wrote: > Hi, Miklos, > > We spotted a performance bottleneck for FUSE writeback in which the > writeback kworker has consumed nearly 100% CPU, among which 40% CPU is > used for copy_page(). > > fuse_writepages_fill > alloc tmp_page > copy_highpage > > This is because of FUSE writeback design (see commit 3be5a52b30aa > ("fuse: support writable mmap")), which newly allocates a temp page for > each dirty page to be written back, copy content of dirty page to temp > page, and then write back the temp page instead. This special design is > intentional to avoid potential deadlocked due to buggy or even malicious > fuse user daemon. I also noticed that and I admin that I don't understand it yet. The commit says <quote> The basic problem is that there can be no guarantee about the time in which the userspace filesystem will complete a write. It may be buggy or even malicious, and fail to complete WRITE requests. We don't want unrelated parts of the system to grind to a halt in such cases. </quote> Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees how fast storage is? Buggy - hmm yeah, then it is splice related only? But I think splice feature was not introduced yet when fuse got mmap and writeback in 2008? Without splice the pages are just copied into a userspace buffer? So what can userspace do wrong with its copy? Failure - why can't it do what nfs_mapping_set_error() does? I guess I miss something, but so far I don't understand what that is. > > There was a proposal of removing this constraint for virtiofs [1], which > is reasonable as users of virtiofs and virtiofs daemon don't run on the > same OS, and virtiofs daemon is usually offered by cloud vendors that > shall not be malicious. While for the normal /dev/fuse interface, I > don't think removing the constraint is acceptable. > > > Come back to the writeback performance bottleneck. Another important > factor is that, (IIUC) only one kworker at the same time is allowed for > writeback for each filesystem instance (if cgroup writeback is not > enabled). The kworker is scheduled upon sb->s_bdi->wb.dwork, and the > workqueue infrastructure guarantees that at most one *running* worker is > allowed for one specific work (sb->s_bdi->wb.dwork) at any time. Thus > the writeback is constraint to one CPU for each filesystem instance. > > I'm not sure if offloading the page copying and then FUSE requests > sending to another worker (if a bunch of dirty pages have been > collected) is a good idea or not, e.g. > > ``` > fuse_writepages_fill > if fuse_writepage_need_send: > # schedule a work > > # the worker > for each dirty page in ap->pages[]: > copy_page > fuse_writepages_send > ``` > > Any suggestion? > > > > This issue can be reproduced by: > > 1 ./libfuse/build/example/passthrough_ll -o cache=always -o writeback -o > source=/run/ /mnt > ("/run/" is a tmpfs mount) > > 2 fio --name=write_test --ioengine=psync --iodepth=1 --rw=write --bs=1M > --direct=0 --size=1G --numjobs=2 --group_reporting --directory=/mnt > (at least two threads are needed; fio shows ~1800MiB/s buffer write > bandwidth) That should quickly run out of tmpfs memory. I need to find time to improve this a bit, but this should give you an easier test: https://github.com/libfuse/libfuse/pull/807 > > > [1] > https://lore.kernel.org/all/20231228123528.705-1-lege.wang@xxxxxxxxxxxxxxx/ > > Thanks, Bernd