Hi Jingbo, On 3/7/24 03:16, Jingbo Xu wrote: > Hi Bernd, > > On 3/6/24 11:45 PM, Bernd Schubert wrote: >> >> >> On 3/6/24 14:32, Jingbo Xu wrote: >>> >>> >>> On 3/5/24 10:26 PM, Miklos Szeredi wrote: >>>> On Mon, 26 Feb 2024 at 05:00, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: >>>>> >>>>> Hi Miklos, >>>>> >>>>> On 1/26/24 2:29 PM, Jingbo Xu wrote: >>>>>> >>>>>> >>>>>> On 1/24/24 8:47 PM, Jingbo Xu wrote: >>>>>>> >>>>>>> >>>>>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote: >>>>>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> From: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx> >>>>>>>>> >>>>>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a >>>>>>>>> single request is increased. >>>>>>>> >>>>>>>> The only worry is about where this memory is getting accounted to. >>>>>>>> This needs to be thought through, since the we are increasing the >>>>>>>> possible memory that an unprivileged user is allowed to pin. >>>>>> >>>>>> Apart from the request size, the maximum number of background requests, >>>>>> i.e. max_background (12 by default, and configurable by the fuse >>>>>> daemon), also limits the size of the memory that an unprivileged user >>>>>> can pin. But yes, it indeed increases the number proportionally by >>>>>> increasing the maximum request size. >>>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> This optimizes the write performance especially when the optimal IO size >>>>>>>>> of the backend store at the fuse daemon side is greater than the original >>>>>>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and >>>>>>>>> 4096 PAGE_SIZE). >>>>>>>>> >>>>>>>>> Be noted that this only increases the upper limit of the maximum request >>>>>>>>> size, while the real maximum request size relies on the FUSE_INIT >>>>>>>>> negotiation with the fuse daemon. >>>>>>>>> >>>>>>>>> Signed-off-by: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx> >>>>>>>>> Signed-off-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> >>>>>>>>> --- >>>>>>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the >>>>>>>>> Bytedance floks seems to had increased the maximum request size to 8M >>>>>>>>> and saw a ~20% performance boost. >>>>>>>> >>>>>>>> The 20% is against the 256 pages, I guess. >>>>>>> >>>>>>> Yeah I guess so. >>>>>>> >>>>>>> >>>>>>>> It would be interesting to >>>>>>>> see the how the number of pages per request affects performance and >>>>>>>> why. >>>>>>> >>>>>>> To be honest, I'm not sure the root cause of the performance boost in >>>>>>> bytedance's case. >>>>>>> >>>>>>> While in our internal use scenario, the optimal IO size of the backend >>>>>>> store at the fuse server side is, e.g. 4MB, and thus if the maximum >>>>>>> throughput can not be achieved with current 256 pages per request. IOW >>>>>>> the backend store, e.g. a distributed parallel filesystem, get optimal >>>>>>> performance when the data is aligned at 4MB boundary. I can ask my folk >>>>>>> who implements the fuse server to give more background info and the >>>>>>> exact performance statistics. >>>>>> >>>>>> Here are more details about our internal use case: >>>>>> >>>>>> We have a fuse server used in our internal cloud scenarios, while the >>>>>> backend store is actually a distributed filesystem. That is, the fuse >>>>>> server actually plays as the client of the remote distributed >>>>>> filesystem. The fuse server forwards the fuse requests to the remote >>>>>> backing store through network, while the remote distributed filesystem >>>>>> handles the IO requests, e.g. process the data from/to the persistent store. >>>>>> >>>>>> Then it comes the details of the remote distributed filesystem when it >>>>>> process the requested data with the persistent store. >>>>>> >>>>>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC >>>>>> (ErasureCode), where each fixed sized user data is split and stored as 8 >>>>>> data blocks plus 3 extra parity blocks. For example, with 512 bytes >>>>>> block size, for each 4MB user data, it's split and stored as 8 (512 >>>>>> bytes) data blocks with 3 (512 bytes) parity blocks. >>>>>> >>>>>> It also utilize the stripe technology to boost the performance, for >>>>>> example, there are 8 data disks and 3 parity disks in the above 8+3 mode >>>>>> example, in which each stripe consists of 8 data blocks and 3 parity >>>>>> blocks. >>>>>> >>>>>> [2] To avoid data corruption on power off, the remote distributed >>>>>> filesystem commit a O_SYNC write right away once a write (fuse) request >>>>>> received. Since the EC described above, when the write fuse request is >>>>>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the >>>>>> other 3MB is read from the persistent store first, then compute the >>>>>> extra 3 parity blocks with the complete 4MB stripe, and finally write >>>>>> the 8 data blocks and 3 parity blocks down. >>>>>> >>>>>> >>>>>> Thus the write amplification is un-neglectable and is the performance >>>>>> bottleneck when the fuse request size is less than the stripe size. >>>>>> >>>>>> Here are some simple performance statistics with varying request size. >>>>>> With 4MB stripe size, there's ~3x bandwidth improvement when the maximum >>>>>> request size is increased from 256KB to 3.9MB, and another ~20% >>>>>> improvement when the request size is increased to 4MB from 3.9MB. >>>> >>>> I sort of understand the issue, although my guess is that this could >>>> be worked around in the client by coalescing writes. This could be >>>> done by adding a small delay before sending a write request off to the >>>> network. >>>> >>>> Would that work in your case? >>> >>> It's possible but I'm not sure. I've asked my colleagues who working on >>> the fuse server and the backend store, though have not been replied yet. >>> But I guess it's not as simple as increasing the maximum FUSE request >>> size directly and thus more complexity gets involved. >>> >>> I can also understand the concern that this may increase the risk of >>> pinning more memory footprint, and a more generic using scenario needs >>> to be considered. I can make it a private patch for our internal product. >>> >>> Thanks for the suggestions and discussion. >> >> It also gets kind of solved in my fuse-over-io-uring branch - as long as >> there are enough free ring entries. I'm going to add in a flag there >> that other CQEs might be follow up requests. Really time to post a new >> version. > > Thanks for the information. I've not read the fuse-over-io-uring branch > yet, but sounds like it would be much helpful . Would there be a flag > in the FUSE request indicating it's one of the linked FUSE requests? Is > this feature, say linked FUSE requests, enabled only when io-uring is > upon FUSE? Current development branch is this https://github.com/bsbernd/linux/tree/fuse-uring-for-6.8 (It sometimes gets rebase/force pushes and incompatible changes - the corresponding libfuse branch is also persistently updated). Patches need clean up before I can send the next RFC version. And I first want to change fixed single request size (not so nice to use 1MB requests when 4K would be sufficient, for things like metadata and small IO). I just checked, struct fuse_write_in has a write_flags field /** * WRITE flags * * FUSE_WRITE_CACHE: delayed write from page cache, file handle is guessed * FUSE_WRITE_LOCKOWNER: lock_owner field is valid * FUSE_WRITE_KILL_SUIDGID: kill suid and sgid bits */ #define FUSE_WRITE_CACHE (1 << 0) #define FUSE_WRITE_LOCKOWNER (1 << 1) #define FUSE_WRITE_KILL_SUIDGID (1 << 2) I guess we could extend that and add flag that more pages are available and will come in the next request - would avoid guessing and timeout on the daemon/server side. With uring that would be helpful as well, but then with uring one can just look through available CQEs and see if these belong together. I don't think there is much control right now on the kernel side to submit multiple requests together but even without that I had seen consecutive requests in a CQE completion round. Bernd