On 3/28/24 17:46, Sweet Tea Dorminy wrote: > > > On 3/7/24 17:06, Bernd Schubert wrote: >> Hi Jingbo, >> >> On 3/7/24 03:16, Jingbo Xu wrote: >>> Hi Bernd, >>> >>> On 3/6/24 11:45 PM, Bernd Schubert wrote: >>>> >>>> >>>> On 3/6/24 14:32, Jingbo Xu wrote: >>>>> >>>>> >>>>> On 3/5/24 10:26 PM, Miklos Szeredi wrote: >>>>>> On Mon, 26 Feb 2024 at 05:00, Jingbo Xu >>>>>> <jefflexu@xxxxxxxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> Hi Miklos, >>>>>>> >>>>>>> On 1/26/24 2:29 PM, Jingbo Xu wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 1/24/24 8:47 PM, Jingbo Xu wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote: >>>>>>>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu >>>>>>>>>> <jefflexu@xxxxxxxxxxxxxxxxx> wrote: >>>>>>>>>>> >>>>>>>>>>> From: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx> >>>>>>>>>>> >>>>>>>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data >>>>>>>>>>> size of a >>>>>>>>>>> single request is increased. >>>>>>>>>> >>>>>>>>>> The only worry is about where this memory is getting accounted >>>>>>>>>> to. >>>>>>>>>> This needs to be thought through, since the we are increasing the >>>>>>>>>> possible memory that an unprivileged user is allowed to pin. >>>>>>>> >>>>>>>> Apart from the request size, the maximum number of background >>>>>>>> requests, >>>>>>>> i.e. max_background (12 by default, and configurable by the fuse >>>>>>>> daemon), also limits the size of the memory that an unprivileged >>>>>>>> user >>>>>>>> can pin. But yes, it indeed increases the number proportionally by >>>>>>>> increasing the maximum request size. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> This optimizes the write performance especially when the >>>>>>>>>>> optimal IO size >>>>>>>>>>> of the backend store at the fuse daemon side is greater than >>>>>>>>>>> the original >>>>>>>>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and >>>>>>>>>>> 4096 PAGE_SIZE). >>>>>>>>>>> >>>>>>>>>>> Be noted that this only increases the upper limit of the >>>>>>>>>>> maximum request >>>>>>>>>>> size, while the real maximum request size relies on the >>>>>>>>>>> FUSE_INIT >>>>>>>>>>> negotiation with the fuse daemon. >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx> >>>>>>>>>>> Signed-off-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> >>>>>>>>>>> --- >>>>>>>>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the >>>>>>>>>>> Bytedance floks seems to had increased the maximum request >>>>>>>>>>> size to 8M >>>>>>>>>>> and saw a ~20% performance boost. >>>>>>>>>> >>>>>>>>>> The 20% is against the 256 pages, I guess. >>>>>>>>> >>>>>>>>> Yeah I guess so. >>>>>>>>> >>>>>>>>> >>>>>>>>>> It would be interesting to >>>>>>>>>> see the how the number of pages per request affects >>>>>>>>>> performance and >>>>>>>>>> why. >>>>>>>>> >>>>>>>>> To be honest, I'm not sure the root cause of the performance >>>>>>>>> boost in >>>>>>>>> bytedance's case. >>>>>>>>> >>>>>>>>> While in our internal use scenario, the optimal IO size of the >>>>>>>>> backend >>>>>>>>> store at the fuse server side is, e.g. 4MB, and thus if the >>>>>>>>> maximum >>>>>>>>> throughput can not be achieved with current 256 pages per >>>>>>>>> request. IOW >>>>>>>>> the backend store, e.g. a distributed parallel filesystem, get >>>>>>>>> optimal >>>>>>>>> performance when the data is aligned at 4MB boundary. I can >>>>>>>>> ask my folk >>>>>>>>> who implements the fuse server to give more background info and >>>>>>>>> the >>>>>>>>> exact performance statistics. >>>>>>>> >>>>>>>> Here are more details about our internal use case: >>>>>>>> >>>>>>>> We have a fuse server used in our internal cloud scenarios, >>>>>>>> while the >>>>>>>> backend store is actually a distributed filesystem. That is, >>>>>>>> the fuse >>>>>>>> server actually plays as the client of the remote distributed >>>>>>>> filesystem. The fuse server forwards the fuse requests to the >>>>>>>> remote >>>>>>>> backing store through network, while the remote distributed >>>>>>>> filesystem >>>>>>>> handles the IO requests, e.g. process the data from/to the >>>>>>>> persistent store. >>>>>>>> >>>>>>>> Then it comes the details of the remote distributed filesystem >>>>>>>> when it >>>>>>>> process the requested data with the persistent store. >>>>>>>> >>>>>>>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC >>>>>>>> (ErasureCode), where each fixed sized user data is split and >>>>>>>> stored as 8 >>>>>>>> data blocks plus 3 extra parity blocks. For example, with 512 bytes >>>>>>>> block size, for each 4MB user data, it's split and stored as 8 (512 >>>>>>>> bytes) data blocks with 3 (512 bytes) parity blocks. >>>>>>>> >>>>>>>> It also utilize the stripe technology to boost the performance, for >>>>>>>> example, there are 8 data disks and 3 parity disks in the above >>>>>>>> 8+3 mode >>>>>>>> example, in which each stripe consists of 8 data blocks and 3 >>>>>>>> parity >>>>>>>> blocks. >>>>>>>> >>>>>>>> [2] To avoid data corruption on power off, the remote distributed >>>>>>>> filesystem commit a O_SYNC write right away once a write (fuse) >>>>>>>> request >>>>>>>> received. Since the EC described above, when the write fuse >>>>>>>> request is >>>>>>>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in >>>>>>>> size, the >>>>>>>> other 3MB is read from the persistent store first, then compute the >>>>>>>> extra 3 parity blocks with the complete 4MB stripe, and finally >>>>>>>> write >>>>>>>> the 8 data blocks and 3 parity blocks down. >>>>>>>> >>>>>>>> >>>>>>>> Thus the write amplification is un-neglectable and is the >>>>>>>> performance >>>>>>>> bottleneck when the fuse request size is less than the stripe size. >>>>>>>> >>>>>>>> Here are some simple performance statistics with varying request >>>>>>>> size. >>>>>>>> With 4MB stripe size, there's ~3x bandwidth improvement when the >>>>>>>> maximum >>>>>>>> request size is increased from 256KB to 3.9MB, and another ~20% >>>>>>>> improvement when the request size is increased to 4MB from 3.9MB. >>>>>> >>>>>> I sort of understand the issue, although my guess is that this could >>>>>> be worked around in the client by coalescing writes. This could be >>>>>> done by adding a small delay before sending a write request off to >>>>>> the >>>>>> network. >>>>>> >>>>>> Would that work in your case? >>>>> >>>>> It's possible but I'm not sure. I've asked my colleagues who >>>>> working on >>>>> the fuse server and the backend store, though have not been replied >>>>> yet. >>>>> But I guess it's not as simple as increasing the maximum FUSE >>>>> request >>>>> size directly and thus more complexity gets involved. >>>>> >>>>> I can also understand the concern that this may increase the risk of >>>>> pinning more memory footprint, and a more generic using scenario needs >>>>> to be considered. I can make it a private patch for our internal >>>>> product. >>>>> >>>>> Thanks for the suggestions and discussion. >>>> >>>> It also gets kind of solved in my fuse-over-io-uring branch - as >>>> long as >>>> there are enough free ring entries. I'm going to add in a flag there >>>> that other CQEs might be follow up requests. Really time to post a new >>>> version. >>> >>> Thanks for the information. I've not read the fuse-over-io-uring branch >>> yet, but sounds like it would be much helpful . Would there be a flag >>> in the FUSE request indicating it's one of the linked FUSE requests? Is >>> this feature, say linked FUSE requests, enabled only when io-uring is >>> upon FUSE? >> >> >> Current development branch is this >> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.8 >> (It sometimes gets rebase/force pushes and incompatible changes - the >> corresponding libfuse branch is also persistently updated). >> >> Patches need clean up before I can send the next RFC version. And I >> first want to change fixed single request size (not so nice to use 1MB >> requests when 4K would be sufficient, for things like metadata and small >> IO). >> > > Let me know if there's something you'd like collaboration on -- > fuse_iouring sounds very exciting and I'd love to help out any way that > would be useful. With pleasure, I take whatever help you offer. Right now I'm quite jumping between between different projects and I'm not too happy that I still didn't sent out a new patch version yet. (And the atomic-open branch also needs updates). > > For our internal usecase at Meta, the relevant backend store operates on > 8M chunks, so I'm also very interested in the simplicity of just opting > in to receiving 8M IOs from the kernel instead of needing to buffer our > own 8MB IOs. But io_uring does seem like a plausible general-purpose > improvement too, so either or both of these paths is interesting and I'm > working on gathering performance numbers on the relative merits. Merging requests requires a bit scanning through the CQEs on the userspace side, it all arrives randomly. I haven't even tried yet to merge requests, I have just seen with debugging that ring the queue gets filled with requests that belong together. Out of interest, are you using libfuse or your own kernel interface library? I would be quite interested to know if the fuse-uring kernel/userspace and then libfuse interface matches your needs. Example, our next-gen DDN file system runs in spdk reactor context and I had to update our own code base and libfuse to support ring polling. So one project outside of libfuse example/ and already some changes needed... Another change I haven't implemented yet in libfuse is ring request buffer registration with the file system (for network rdma). Btw, I just run into bug that came up with FUSE_CAP_WRITEBACK_CACHE - I definitely don't claim that all code paths are perfectly tested already (fixed now in the fuse-uring-for-6.8 branch). Thanks, Bernd