Re: [LSF/MM/BFP ATTEND][LSF/MM/BFP TOPIC] fuse uring communication

Bernd Schubert <bschubert@xxxxxxx> · Fri, 10 Feb 2023 11:46:44 +0000

On 2/10/23 11:45, Miklos Szeredi wrote:
> On Sun, 5 Feb 2023 at 02:00, Bernd Schubert <bschubert@xxxxxxx> wrote:
>>
>> Hello,
>>
>> I'm working for some time on fuse uring based communication that is numa
>> aware and core-affine.
> 
> I might have mentioned this earlier, but one of the bigger issues with
> NUMA that I found was that having a single process with multiple
> threads serving queues of different NUMA nodes incurs a performance
> hit each time a server thread gets to run. This is due to having to
> update mm->cpu_bitmap, which indicates on which  CPUs the current
> process is running on.  This bitmap is shared by the address space,
> hence constantly updating it from different nodes means having to move
> it from one node to the other.

For our current usage we have entirely restricted the fuse daemon to run 
on one numa node only. Some years ago I had tested clone_fd - we could 
see for some workloads that a single fd run into a spin-lock contention 
and clone_fd solved that, but it didn't solve most of the other 
performance issues.

> 
> My workaround was to use separate processes (address space is not
> shared) but use shared memory for common structures.  This complicates
> things quite a bit, so it would be nice to find some other way of
> fixing this issue.  For example it occurs to me that making this
> bitmap use different cachelines for CPUs that are on different nodes
> might actually help fix the issue.

With my uring approach you will get a ring thread per core and basically 
no shared data structures - that should solve the issue? Well 
'fuse_connection' is still shared, but then has queues per ring 
(actually you remind me that I need to make the queues cache line 
aligned, on kernel and daemon side).

Well struct fuse_conn holds 'struct fuse_ring', with an
/* XXX: Move to struct fuse_dev? */

There are some design decisions that are certainly debatable and I have 
marked some of these with such XXX comments.

> 
>> In the current /dev/fuse based IO model requests are queued on lists
>> that are not core-affine or numa aware. For every request a round trip
>> between userspace and kernel is needed.
>> When we benchmarked our atomic-open patches (also still WIP) initially
>> confusing findings came up [1] and could be tracked down to multiple
>> threads reading from /dev/fuse. After switching to a single thread that
>> reads from /dev/fuse we got consistent and expected results.
>> Later we also figured out that adding a polling spin fuse_dev_do_read()
>> before going into a waitq sleep when no request is available greatly
>> improved meta data benchmark performance [2].
>>
>> That made us to think about the current communication and to look into a
>> ring based queuing model. Around that time IORING_OP_URING_CMD was added
>> to uring and the new userspace block device driver (ublk) is using that
>> command, to send requests from kernel to userspace.
>> I started to look how ublk works and started to adapt a similar model to
>> fuse. State as today is that it is basically working, but I'm still
>> fixing issues found by xfstests. Benchmarks and patch cleanup for
>> submission follow next.
>>
>> https://github.com/bsbernd/linux/tree/fuse-uring
>> https://github.com/bsbernd/libfuse/tree/uring
>> (these branches will _not_ be used for upstream submission, these are
>> purely for base development)
>>
>>
>> A fuse design documentation update will also be added in the 1st RFC
>> request, basic details follow as
>>
>> - Initial mount setup goes over /dev/fuse
>> - fuse.ko queues FUSE_INIT in the existing /dev/fuse (background) queue
>> - User space sets up the ring and all queues with a new ioctl
>> - fuse.ko sets up the ring and allocates request queues/request memory
>> per queue/request
>> - Userspace mmaps these buffers and assigns them per queue/request
>> - Data are send through these mmaped buffers, there is no kmap involved
>> (difference to ublk)
> 
> How is the queue buffer filled?  Are requests packed or is the queue
> divided into equal parts for each request?

The latter, queues are divided into equal parts - which gives the ring 
queue depth. I have further divided these with credits into pending and 
background. My reasoning is that background is basically anything page 
cache related and we do not want to introduce latencies due to filled 
queue with background writes and read-head.

> 
> How replies are sent?  Do they use the same buffer?

Queues/requests use a shared memory buffer between kernel and daemon.

> 
>> - Similar to ublk user space first submits SQEs with as
>> FUSE_URING_REQ_FETCH, then later as FUSE_URING_REQ_COMMIT_AND_FETCH -
>> commit results of the current request and fetch the next one.
>> - FUSE_URING_REQ_FETCH also takes the FUSE_INIT request, later these
>> lists are not checked anymore, as there is nothing supposed to be on them
> 
> Which list?  If the FUSE_INIT is handled on /dev/fuse why handle it on
> the uring?

struct fuse_iqueue::pending

Yeah, we could leave FUSE_INIT with /dev/fuse IO, but then using the 
ring for that is not so much more complicated and FUSE_INIT actually a 
nice startup test if the ring basically works.

> 
>> - The ring currently only only handles fuse pending and background
>> requests (with credits assigned)
>> - Forget requires libfuse still read /dev/fuse (handling will be added
>> to the ring later)
>> - In the WIP state request interrupts are not supported (yet)
>> - Userspace needs to send fuse notifications to /dev/fuse, needs to be
>> handled by the ring as well (or maybe a separate ring)
>> - My goal was to keep compatibility with existing fuse file systems,
>> except of the so far missing interrupt handling that should work so far.
> 
> Interrupts and notifications are used by very few fs.  So if it's
> easier, then we could leave one thread to handle legacy /dev/fuse
> requests for anything that's not performance sensitive.

Interrupts maybe, but our product that is currently in active 
development has a DLM and will be a heavy user of notifications. So 
easier yes, but mismatch with our needs.

I'm still in the process to fixing issues I had overseen, I hope to get 
that done today, so that I can work on clean patches for upstream, that 
will also explain things in commit messages and updated 
Documentation/filesystems/fuse.rst. I really hope to have first patches 
ready next week.

Thanks,
Bernd