On Mon, 26 Feb 2024 at 05:00, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > > Hi Miklos, > > On 1/26/24 2:29 PM, Jingbo Xu wrote: > > > > > > On 1/24/24 8:47 PM, Jingbo Xu wrote: > >> > >> > >> On 1/24/24 8:23 PM, Miklos Szeredi wrote: > >>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > >>>> > >>>> From: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx> > >>>> > >>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a > >>>> single request is increased. > >>> > >>> The only worry is about where this memory is getting accounted to. > >>> This needs to be thought through, since the we are increasing the > >>> possible memory that an unprivileged user is allowed to pin. > > > > Apart from the request size, the maximum number of background requests, > > i.e. max_background (12 by default, and configurable by the fuse > > daemon), also limits the size of the memory that an unprivileged user > > can pin. But yes, it indeed increases the number proportionally by > > increasing the maximum request size. > > > > > >> > >>> > >>> > >>> > >>>> > >>>> This optimizes the write performance especially when the optimal IO size > >>>> of the backend store at the fuse daemon side is greater than the original > >>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and > >>>> 4096 PAGE_SIZE). > >>>> > >>>> Be noted that this only increases the upper limit of the maximum request > >>>> size, while the real maximum request size relies on the FUSE_INIT > >>>> negotiation with the fuse daemon. > >>>> > >>>> Signed-off-by: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx> > >>>> Signed-off-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> > >>>> --- > >>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the > >>>> Bytedance floks seems to had increased the maximum request size to 8M > >>>> and saw a ~20% performance boost. > >>> > >>> The 20% is against the 256 pages, I guess. > >> > >> Yeah I guess so. > >> > >> > >>> It would be interesting to > >>> see the how the number of pages per request affects performance and > >>> why. > >> > >> To be honest, I'm not sure the root cause of the performance boost in > >> bytedance's case. > >> > >> While in our internal use scenario, the optimal IO size of the backend > >> store at the fuse server side is, e.g. 4MB, and thus if the maximum > >> throughput can not be achieved with current 256 pages per request. IOW > >> the backend store, e.g. a distributed parallel filesystem, get optimal > >> performance when the data is aligned at 4MB boundary. I can ask my folk > >> who implements the fuse server to give more background info and the > >> exact performance statistics. > > > > Here are more details about our internal use case: > > > > We have a fuse server used in our internal cloud scenarios, while the > > backend store is actually a distributed filesystem. That is, the fuse > > server actually plays as the client of the remote distributed > > filesystem. The fuse server forwards the fuse requests to the remote > > backing store through network, while the remote distributed filesystem > > handles the IO requests, e.g. process the data from/to the persistent store. > > > > Then it comes the details of the remote distributed filesystem when it > > process the requested data with the persistent store. > > > > [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC > > (ErasureCode), where each fixed sized user data is split and stored as 8 > > data blocks plus 3 extra parity blocks. For example, with 512 bytes > > block size, for each 4MB user data, it's split and stored as 8 (512 > > bytes) data blocks with 3 (512 bytes) parity blocks. > > > > It also utilize the stripe technology to boost the performance, for > > example, there are 8 data disks and 3 parity disks in the above 8+3 mode > > example, in which each stripe consists of 8 data blocks and 3 parity > > blocks. > > > > [2] To avoid data corruption on power off, the remote distributed > > filesystem commit a O_SYNC write right away once a write (fuse) request > > received. Since the EC described above, when the write fuse request is > > not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the > > other 3MB is read from the persistent store first, then compute the > > extra 3 parity blocks with the complete 4MB stripe, and finally write > > the 8 data blocks and 3 parity blocks down. > > > > > > Thus the write amplification is un-neglectable and is the performance > > bottleneck when the fuse request size is less than the stripe size. > > > > Here are some simple performance statistics with varying request size. > > With 4MB stripe size, there's ~3x bandwidth improvement when the maximum > > request size is increased from 256KB to 3.9MB, and another ~20% > > improvement when the request size is increased to 4MB from 3.9MB. I sort of understand the issue, although my guess is that this could be worked around in the client by coalescing writes. This could be done by adding a small delay before sending a write request off to the network. Would that work in your case? Thanks, Miklos