On 5/30/24 09:07, Amir Goldstein wrote: > On Wed, May 29, 2024 at 9:01 PM Bernd Schubert <bschubert@xxxxxxx> wrote: >> >> From: Bernd Schubert <bschubert@xxxxxxx> >> >> This adds support for uring communication between kernel and >> userspace daemon using opcode the IORING_OP_URING_CMD. The basic >> appraoch was taken from ublk. The patches are in RFC state, >> some major changes are still to be expected. >> >> Motivation for these patches is all to increase fuse performance. >> In fuse-over-io-uring requests avoid core switching (application >> on core X, processing of fuse server on random core Y) and use >> shared memory between kernel and userspace to transfer data. >> Similar approaches have been taken by ZUFS and FUSE2, though >> not over io-uring, but through ioctl IOs >> >> https://lwn.net/Articles/756625/ >> https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2 >> >> Avoiding cache line bouncing / numa systems was discussed >> between Amir and Miklos before and Miklos had posted >> part of the private discussion here >> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@xxxxxxxxxxxxxx/ >> >> This cache line bouncing should be addressed by these patches >> as well. >> >> I had also noticed waitq wake-up latencies in fuse before >> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@xxxxxxxxxxx/T/ >> >> This spinning approach helped with performance (>40% improvement >> for file creates), but due to random server side thread/core utilization >> spinning cannot be well controlled in /dev/fuse mode. >> With fuse-over-io-uring requests are handled on the same core >> (sync requests) or on core+1 (large async requests) and performance >> improvements are achieved without spinning. >> >> Splice/zero-copy is not supported yet, Ming Lei is working >> on io-uring support for ublk_drv, but I think so far there >> is no final agreement on the approach to be taken yet. >> Fuse-over-io-uring runs significantly faster than reads/writes >> over /dev/fuse, even with splice enabled, so missing zc >> should not be a blocking issue. >> >> The patches have been tested with multiple xfstest runs in a VM >> (32 cores) with a kernel that has several debug options >> enabled (like KASAN and MSAN). >> For some tests xfstests reports that O_DIRECT is not supported, >> I need to investigate that. Interesting part is that exactly >> these tests fail in plain /dev/fuse posix mode. I had to disabled >> generic/650, which is enabling/disabling cpu cores - given ring >> threads are bound to cores issues with that are no totally >> unexpected, but then there (scheduler) kernel messages that >> core binding for these threads is removed - this needs >> to be further investigates. >> Nice effect in io-uring mode is that tests run faster (like >> generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still >> slow as this is with ASAN/leak-detection/etc. >> >> The corresponding libfuse patches are on my uring branch, >> but need cleanup for submission - will happen during the next >> days. >> https://github.com/bsbernd/libfuse/tree/uring >> >> If it should make review easier, patches posted here are on >> this branch >> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.9-rfc2 >> >> TODO list for next RFC versions >> - Let the ring configure ioctl return information, like mmap/queue-buf size >> - Request kernel side address and len for a request - avoid calculation in userspace? >> - multiple IO sizes per queue (avoiding a calculation in userspace is probably even >> more important) >> - FUSE_INTERRUPT handling? >> - Logging (adds fields in the ioctl and also ring-request), >> any mismatch between client and server is currently very hard to understand >> through error codes >> >> Future work >> - notifications, probably on their own ring >> - zero copy >> >> I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023, >> which, resulted in some tuning patches (at the end of the >> patch series). >> >> Some benchmark results >> ====================== >> >> System used for the benchmark is a 32 core (HyperThreading enabled) >> Xeon E5-2650 system. I don't have local disks attached that could do >>> 5GB/s IOs, for paged and dio results a patched version of passthrough-hp >> was used that bypasses final reads/writes. >> >> paged reads >> ----------- >> 128K IO size 1024K IO size >> jobs /dev/fuse uring gain /dev/fuse uring gain >> 1 1117 1921 1.72 1902 1942 1.02 >> 2 2502 3527 1.41 3066 3260 1.06 >> 4 5052 6125 1.21 5994 6097 1.02 >> 8 6273 10855 1.73 7101 10491 1.48 >> 16 6373 11320 1.78 7660 11419 1.49 >> 24 6111 9015 1.48 7600 9029 1.19 >> 32 5725 7968 1.39 6986 7961 1.14 >> >> dio reads (1024K) >> ----------------- >> >> jobs /dev/fuse uring gain >> 1 2023 3998 2.42 >> 2 3375 7950 2.83 >> 4 3823 15022 3.58 >> 8 7796 22591 2.77 >> 16 8520 27864 3.27 >> 24 8361 20617 2.55 >> 32 8717 12971 1.55 >> >> mmap reads (4K) >> --------------- >> (sequential, I probably should have made it random, sequential exposes >> a rather interesting/weird 'optimized' memcpy issue - sequential becomes >> reversed order 4K read) >> https://lore.kernel.org/linux-fsdevel/aae918da-833f-7ec5-ac8a-115d66d80d0e@xxxxxxxxxxx/ >> >> jobs /dev/fuse uring gain >> 1 130 323 2.49 >> 2 219 538 2.46 >> 4 503 1040 2.07 >> 8 1472 2039 1.38 >> 16 2191 3518 1.61 >> 24 2453 4561 1.86 >> 32 2178 5628 2.58 >> >> (Results on request, setting MAP_HUGETLB much improves performance >> for both, io-uring mode then has a slight advantage only.) >> >> creates/s >> ---------- >> threads /dev/fuse uring gain >> 1 3944 10121 2.57 >> 2 8580 24524 2.86 >> 4 16628 44426 2.67 >> 8 46746 56716 1.21 >> 16 79740 102966 1.29 >> 20 80284 119502 1.49 >> >> (the gain drop with >=8 cores needs to be investigated) > Hi Amir, > Hi Bernd, > > Those are impressive results! thank you! > > When approaching the FUSE uring feature from marketing POV, > I think that putting the emphasis on metadata operations is the > best approach. I can add in some more results and probably need to redo at least the metadata tests. I have all the results in google docs and in plain text files, just a bit cumbersome maybe also spam to post all of it here. > > Not the dio reads are not important (I know that is part of your use case), > but I imagine there are a lot more people out there waiting for > improvement in metadata operations overhead. I think the DIO use case is declining. My fuse work is now related to the DDN Infina project, which has a DLM - this will all go via cache and notifications (into from/to client/server) I need to start to work on that asap... I'm also not too happy yet about cached writes/reads - need to find time to investigate where the limit is. > > To me it helps to know what the current main pain points are > for people using FUSE filesystems wrt performance. > > Although it may not be uptodate, the most comprehensive > study about FUSE performance overhead is this FAST17 paper: > > https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf Yeah, I had seen it. Just checking again, interesting is actually their instrumentation branch https://github.com/sbu-fsl/fuse-kernel-instrumentation This should be very useful upstream, in combination with Josefs fuse tracepoints (btw, thanks for the tracepoint patch Josef! I'm going to look at it and test it tomorrow). > > In this paper, table 3 summarizes the different overheads observed > per workload. According to this table, the workloads that degrade > performance worse on an optimized passthrough fs over SSD are: > - many file creates > - many file deletes > - many small file reads > In all these workloads, it was millions of files over many directories. > The highest performance regression reported was -83% on many > small file creations. > > The moral of this long story is that it would be nice to know > what performance improvement FUSE uring can aspire to. > This is especially relevant for people that would be interested > in combining the benefits of FUSE passthrough (for data) and > FUSE uring (for metadata). As written above, I can add a few more data. But if possible I wouldn't like to concentrate on benchmarking - this can be super time consuming and doesn't help unless one investigates what is actually limiting performance. Right now we see that io-uring helps, fixing the other limits is then the next step, imho. > > What did passthrough_hp do in your patched version with creates? > Did it actually create the files? Yeah, it creates files, I think on xfs (or ext4). I had tried tmpfs first, but it had issues with seekdir/telldir until recently - will switch back to tmpfs for next tests. > In how many directories? > Maybe the directory inode lock impeded performance improvement > with >=8 threads? I don't think the directory inode lock is an issue - this should be one (or more directories) per thread Basically /usr/lib64/openmpi/bin/mpirun \ --mca btl self -n $i --oversubscribe \ ./mdtest -F -n40000 -i1 \ -d /scratch/dest -u -b2 | tee ${fname}-$i.out (mdtest is really convenient for meta operations, although requires mpi, recent versions are here (the initial LLNL project merged with ior). https://github.com/hpc/ior "-F" Perform test on files only (no directories). "-n" number_of_items Every process will creat/stat/remove # directories and files "-i" iterations The number of iterations the test will run "-u" Create a unique working directory for each task "-b" branching_factor The branching factor of the hierarchical directory structure [default: 1]. (The older LLNL repo has a better mdtest README https://github.com/LLNL/mdtest) Also, regarding metadata, I definitely need to find time resume work on atomic-open. Besides performance, there is another use case https://github.com/libfuse/libfuse/issues/945. Sweet Tea Dorminy / Josef also seem to need that. > >> >> Remaining TODO list for RFCv3: >> -------------------------------- >> 1) Let the ring configure ioctl return information, >> like mmap/queue-buf size >> >> Right now libfuse and kernel have lots of duplicated setup code >> and any kind of pointer/offset mismatch results in a non-working >> ring that is hard to debug - probably better when the kernel does >> the calculations and returns that to server side >> >> 2) In combination with 1, ring requests should retrieve their >> userspace address and length from kernel side instead of >> calculating it through the mmaped queue buffer on their own. >> (Introduction of FUSE_URING_BUF_ADDR_FETCH) >> >> 3) Add log buffer into the ioctl and ring-request >> >> This is to provide better error messages (instead of just >> errno) >> >> 3) Multiple IO sizes per queue >> >> Small IOs and metadata requests do not need large buffer sizes, >> we need multiple IO sizes per queue. >> >> 4) FUSE_INTERRUPT handling >> >> These are not handled yet, kernel side is probably not difficult >> anymore as ring entries take fuse requests through lists. >> >> Long term TODO: >> -------------- >> Notifications through io-uring, maybe with a separated ring, >> but I'm not sure yet. > > Is that going to improve performance in any real life workload? > I'm rather sure that we at DDN will need it for our project with the DLM. I have other priorities for now - once it comes up, adding notifications over uring shouldn't be difficult. Thanks, Bernd