On 3/6/24 14:32, Jingbo Xu wrote: > > > On 3/5/24 10:26 PM, Miklos Szeredi wrote: >> On Mon, 26 Feb 2024 at 05:00, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: >>> >>> Hi Miklos, >>> >>> On 1/26/24 2:29 PM, Jingbo Xu wrote: >>>> >>>> >>>> On 1/24/24 8:47 PM, Jingbo Xu wrote: >>>>> >>>>> >>>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote: >>>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> From: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx> >>>>>>> >>>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a >>>>>>> single request is increased. >>>>>> >>>>>> The only worry is about where this memory is getting accounted to. >>>>>> This needs to be thought through, since the we are increasing the >>>>>> possible memory that an unprivileged user is allowed to pin. >>>> >>>> Apart from the request size, the maximum number of background requests, >>>> i.e. max_background (12 by default, and configurable by the fuse >>>> daemon), also limits the size of the memory that an unprivileged user >>>> can pin. But yes, it indeed increases the number proportionally by >>>> increasing the maximum request size. >>>> >>>> >>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> This optimizes the write performance especially when the optimal IO size >>>>>>> of the backend store at the fuse daemon side is greater than the original >>>>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and >>>>>>> 4096 PAGE_SIZE). >>>>>>> >>>>>>> Be noted that this only increases the upper limit of the maximum request >>>>>>> size, while the real maximum request size relies on the FUSE_INIT >>>>>>> negotiation with the fuse daemon. >>>>>>> >>>>>>> Signed-off-by: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx> >>>>>>> Signed-off-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> >>>>>>> --- >>>>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the >>>>>>> Bytedance floks seems to had increased the maximum request size to 8M >>>>>>> and saw a ~20% performance boost. >>>>>> >>>>>> The 20% is against the 256 pages, I guess. >>>>> >>>>> Yeah I guess so. >>>>> >>>>> >>>>>> It would be interesting to >>>>>> see the how the number of pages per request affects performance and >>>>>> why. >>>>> >>>>> To be honest, I'm not sure the root cause of the performance boost in >>>>> bytedance's case. >>>>> >>>>> While in our internal use scenario, the optimal IO size of the backend >>>>> store at the fuse server side is, e.g. 4MB, and thus if the maximum >>>>> throughput can not be achieved with current 256 pages per request. IOW >>>>> the backend store, e.g. a distributed parallel filesystem, get optimal >>>>> performance when the data is aligned at 4MB boundary. I can ask my folk >>>>> who implements the fuse server to give more background info and the >>>>> exact performance statistics. >>>> >>>> Here are more details about our internal use case: >>>> >>>> We have a fuse server used in our internal cloud scenarios, while the >>>> backend store is actually a distributed filesystem. That is, the fuse >>>> server actually plays as the client of the remote distributed >>>> filesystem. The fuse server forwards the fuse requests to the remote >>>> backing store through network, while the remote distributed filesystem >>>> handles the IO requests, e.g. process the data from/to the persistent store. >>>> >>>> Then it comes the details of the remote distributed filesystem when it >>>> process the requested data with the persistent store. >>>> >>>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC >>>> (ErasureCode), where each fixed sized user data is split and stored as 8 >>>> data blocks plus 3 extra parity blocks. For example, with 512 bytes >>>> block size, for each 4MB user data, it's split and stored as 8 (512 >>>> bytes) data blocks with 3 (512 bytes) parity blocks. >>>> >>>> It also utilize the stripe technology to boost the performance, for >>>> example, there are 8 data disks and 3 parity disks in the above 8+3 mode >>>> example, in which each stripe consists of 8 data blocks and 3 parity >>>> blocks. >>>> >>>> [2] To avoid data corruption on power off, the remote distributed >>>> filesystem commit a O_SYNC write right away once a write (fuse) request >>>> received. Since the EC described above, when the write fuse request is >>>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the >>>> other 3MB is read from the persistent store first, then compute the >>>> extra 3 parity blocks with the complete 4MB stripe, and finally write >>>> the 8 data blocks and 3 parity blocks down. >>>> >>>> >>>> Thus the write amplification is un-neglectable and is the performance >>>> bottleneck when the fuse request size is less than the stripe size. >>>> >>>> Here are some simple performance statistics with varying request size. >>>> With 4MB stripe size, there's ~3x bandwidth improvement when the maximum >>>> request size is increased from 256KB to 3.9MB, and another ~20% >>>> improvement when the request size is increased to 4MB from 3.9MB. >> >> I sort of understand the issue, although my guess is that this could >> be worked around in the client by coalescing writes. This could be >> done by adding a small delay before sending a write request off to the >> network. >> >> Would that work in your case? > > It's possible but I'm not sure. I've asked my colleagues who working on > the fuse server and the backend store, though have not been replied yet. > But I guess it's not as simple as increasing the maximum FUSE request > size directly and thus more complexity gets involved. > > I can also understand the concern that this may increase the risk of > pinning more memory footprint, and a more generic using scenario needs > to be considered. I can make it a private patch for our internal product. > > Thanks for the suggestions and discussion. It also gets kind of solved in my fuse-over-io-uring branch - as long as there are enough free ring entries. I'm going to add in a flag there that other CQEs might be follow up requests. Really time to post a new version. Bernd