Hi, Miklos, We spotted a performance bottleneck for FUSE writeback in which the writeback kworker has consumed nearly 100% CPU, among which 40% CPU is used for copy_page(). fuse_writepages_fill alloc tmp_page copy_highpage This is because of FUSE writeback design (see commit 3be5a52b30aa ("fuse: support writable mmap")), which newly allocates a temp page for each dirty page to be written back, copy content of dirty page to temp page, and then write back the temp page instead. This special design is intentional to avoid potential deadlocked due to buggy or even malicious fuse user daemon. There was a proposal of removing this constraint for virtiofs [1], which is reasonable as users of virtiofs and virtiofs daemon don't run on the same OS, and virtiofs daemon is usually offered by cloud vendors that shall not be malicious. While for the normal /dev/fuse interface, I don't think removing the constraint is acceptable. Come back to the writeback performance bottleneck. Another important factor is that, (IIUC) only one kworker at the same time is allowed for writeback for each filesystem instance (if cgroup writeback is not enabled). The kworker is scheduled upon sb->s_bdi->wb.dwork, and the workqueue infrastructure guarantees that at most one *running* worker is allowed for one specific work (sb->s_bdi->wb.dwork) at any time. Thus the writeback is constraint to one CPU for each filesystem instance. I'm not sure if offloading the page copying and then FUSE requests sending to another worker (if a bunch of dirty pages have been collected) is a good idea or not, e.g. ``` fuse_writepages_fill if fuse_writepage_need_send: # schedule a work # the worker for each dirty page in ap->pages[]: copy_page fuse_writepages_send ``` Any suggestion? This issue can be reproduced by: 1 ./libfuse/build/example/passthrough_ll -o cache=always -o writeback -o source=/run/ /mnt ("/run/" is a tmpfs mount) 2 fio --name=write_test --ioengine=psync --iodepth=1 --rw=write --bs=1M --direct=0 --size=1G --numjobs=2 --group_reporting --directory=/mnt (at least two threads are needed; fio shows ~1800MiB/s buffer write bandwidth) [1] https://lore.kernel.org/all/20231228123528.705-1-lege.wang@xxxxxxxxxxxxxxx/ -- Thanks, Jingbo