Hi, Joanne and Miklos, On 11/8/24 7:56 AM, Joanne Koong wrote: > Currently, we allocate and copy data to a temporary folio when > handling writeback in order to mitigate the following deadlock scenario > that may arise if reclaim waits on writeback to complete: > * single-threaded FUSE server is in the middle of handling a request > that needs a memory allocation > * memory allocation triggers direct reclaim > * direct reclaim waits on a folio under writeback > * the FUSE server can't write back the folio since it's stuck in > direct reclaim > > To work around this, we allocate a temporary folio and copy over the > original folio to the temporary folio so that writeback can be > immediately cleared on the original folio. This additionally requires us > to maintain an internal rb tree to keep track of writeback state on the > temporary folios. > > A recent change prevents reclaim logic from waiting on writeback for > folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it. > This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which > will prevent FUSE folios from running into the reclaim deadlock described > above) and removes the temporary folio + extra copying and the internal > rb tree. > > fio benchmarks -- > (using averages observed from 10 runs, throwing away outliers) > > Setup: > sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount > ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount > > fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G > --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount > > bs = 1k 4k 1M > Before 351 MiB/s 1818 MiB/s 1851 MiB/s > After 341 MiB/s 2246 MiB/s 2685 MiB/s > % diff -3% 23% 45% > > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> IIUC this patch seems to break commit 8b284dc47291daf72fe300e1138a2e7ed56f38ab ("fuse: writepages: handle same page rewrites"). > - /* > - * Being under writeback is unlikely but possible. For example direct > - * read to an mmaped fuse file will set the page dirty twice; once when > - * the pages are faulted with get_user_pages(), and then after the read > - * completed. > - */ In short, the target scenario is like: ``` # open a fuse file and mmap fd1 = open("fuse-file-path", ...) uaddr = mmap(fd1, ...) # DIRECT read to the mmaped fuse file fd2 = open("ext4-file-path", O_DIRECT, ...) read(fd2, uaddr, ...) # get_user_pages() of uaddr, and triggers faultin # a_ops->dirty_folio() <--- mark PG_dirty # when DIRECT IO completed: # a_ops->dirty_folio() <--- mark PG_dirty ``` The auxiliary write request list was introduced to fix this. I'm not sure if there's an alternative other than the auxiliary list to fix it, e.g. calling folio_wait_writeback() in a_ops->dirty_folio() so that the same folio won't get dirtied when the writeback has not completed yet? -- Thanks, Jingbo