Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree

Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> · Mon, 11 Nov 2024 16:32:20 +0800

Hi, Joanne and Miklos,

On 11/8/24 7:56 AM, Joanne Koong wrote:
> Currently, we allocate and copy data to a temporary folio when
> handling writeback in order to mitigate the following deadlock scenario
> that may arise if reclaim waits on writeback to complete:
> * single-threaded FUSE server is in the middle of handling a request
>   that needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback
> * the FUSE server can't write back the folio since it's stuck in
>   direct reclaim
> 
> To work around this, we allocate a temporary folio and copy over the
> original folio to the temporary folio so that writeback can be
> immediately cleared on the original folio. This additionally requires us
> to maintain an internal rb tree to keep track of writeback state on the
> temporary folios.
> 
> A recent change prevents reclaim logic from waiting on writeback for
> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> will prevent FUSE folios from running into the reclaim deadlock described
> above) and removes the temporary folio + extra copying and the internal
> rb tree.
> 
> fio benchmarks --
> (using averages observed from 10 runs, throwing away outliers)
> 
> Setup:
> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> 
> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> 
>         bs =  1k          4k            1M
> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> % diff        -3%          23%         45%
> 
> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>

IIUC this patch seems to break commit
8b284dc47291daf72fe300e1138a2e7ed56f38ab ("fuse: writepages: handle same
page rewrites").

> -	/*
> -	 * Being under writeback is unlikely but possible.  For example direct
> -	 * read to an mmaped fuse file will set the page dirty twice; once when
> -	 * the pages are faulted with get_user_pages(), and then after the read
> -	 * completed.
> -	 */

In short, the target scenario is like:

```
# open a fuse file and mmap
fd1 = open("fuse-file-path", ...)
uaddr = mmap(fd1, ...)

# DIRECT read to the mmaped fuse file
fd2 = open("ext4-file-path", O_DIRECT, ...)
read(fd2, uaddr, ...)
    # get_user_pages() of uaddr, and triggers faultin
    # a_ops->dirty_folio() <--- mark PG_dirty

    # when DIRECT IO completed:
    # a_ops->dirty_folio() <--- mark PG_dirty
```

The auxiliary write request list was introduced to fix this.

I'm not sure if there's an alternative other than the auxiliary list to
fix it, e.g. calling folio_wait_writeback() in a_ops->dirty_folio() so
that the same folio won't get dirtied when the writeback has not
completed yet?

-- 
Thanks,
Jingbo