Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit

Miklos Szeredi <miklos@xxxxxxxxxx> · Tue, 5 Mar 2024 15:26:26 +0100

On Mon, 26 Feb 2024 at 05:00, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>
> Hi Miklos,
>
> On 1/26/24 2:29 PM, Jingbo Xu wrote:
> >
> >
> > On 1/24/24 8:47 PM, Jingbo Xu wrote:
> >>
> >>
> >> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
> >>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>> From: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx>
> >>>>
> >>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
> >>>> single request is increased.
> >>>
> >>> The only worry is about where this memory is getting accounted to.
> >>> This needs to be thought through, since the we are increasing the
> >>> possible memory that an unprivileged user is allowed to pin.
> >
> > Apart from the request size, the maximum number of background requests,
> > i.e. max_background (12 by default, and configurable by the fuse
> > daemon), also limits the size of the memory that an unprivileged user
> > can pin.  But yes, it indeed increases the number proportionally by
> > increasing the maximum request size.
> >
> >
> >>
> >>>
> >>>
> >>>
> >>>>
> >>>> This optimizes the write performance especially when the optimal IO size
> >>>> of the backend store at the fuse daemon side is greater than the original
> >>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
> >>>> 4096 PAGE_SIZE).
> >>>>
> >>>> Be noted that this only increases the upper limit of the maximum request
> >>>> size, while the real maximum request size relies on the FUSE_INIT
> >>>> negotiation with the fuse daemon.
> >>>>
> >>>> Signed-off-by: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx>
> >>>> Signed-off-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx>
> >>>> ---
> >>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
> >>>> Bytedance floks seems to had increased the maximum request size to 8M
> >>>> and saw a ~20% performance boost.
> >>>
> >>> The 20% is against the 256 pages, I guess.
> >>
> >> Yeah I guess so.
> >>
> >>
> >>> It would be interesting to
> >>> see the how the number of pages per request affects performance and
> >>> why.
> >>
> >> To be honest, I'm not sure the root cause of the performance boost in
> >> bytedance's case.
> >>
> >> While in our internal use scenario, the optimal IO size of the backend
> >> store at the fuse server side is, e.g. 4MB, and thus if the maximum
> >> throughput can not be achieved with current 256 pages per request. IOW
> >> the backend store, e.g. a distributed parallel filesystem, get optimal
> >> performance when the data is aligned at 4MB boundary.  I can ask my folk
> >> who implements the fuse server to give more background info and the
> >> exact performance statistics.
> >
> > Here are more details about our internal use case:
> >
> > We have a fuse server used in our internal cloud scenarios, while the
> > backend store is actually a distributed filesystem.  That is, the fuse
> > server actually plays as the client of the remote distributed
> > filesystem.  The fuse server forwards the fuse requests to the remote
> > backing store through network, while the remote distributed filesystem
> > handles the IO requests, e.g. process the data from/to the persistent store.
> >
> > Then it comes the details of the remote distributed filesystem when it
> > process the requested data with the persistent store.
> >
> > [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
> > (ErasureCode), where each fixed sized user data is split and stored as 8
> > data blocks plus 3 extra parity blocks. For example, with 512 bytes
> > block size, for each 4MB user data, it's split and stored as 8 (512
> > bytes) data blocks with 3 (512 bytes) parity blocks.
> >
> > It also utilize the stripe technology to boost the performance, for
> > example, there are 8 data disks and 3 parity disks in the above 8+3 mode
> > example, in which each stripe consists of 8 data blocks and 3 parity
> > blocks.
> >
> > [2] To avoid data corruption on power off, the remote distributed
> > filesystem commit a O_SYNC write right away once a write (fuse) request
> > received.  Since the EC described above, when the write fuse request is
> > not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the
> > other 3MB is read from the persistent store first, then compute the
> > extra 3 parity blocks with the complete 4MB stripe, and finally write
> > the 8 data blocks and 3 parity blocks down.
> >
> >
> > Thus the write amplification is un-neglectable and is the performance
> > bottleneck when the fuse request size is less than the stripe size.
> >
> > Here are some simple performance statistics with varying request size.
> > With 4MB stripe size, there's ~3x bandwidth improvement when the maximum
> > request size is increased from 256KB to 3.9MB, and another ~20%
> > improvement when the request size is increased to 4MB from 3.9MB.

I sort of understand the issue, although my guess is that this could
be worked around in the client by coalescing writes.  This could be
done by adding a small delay before sending a write request off to the
network.

Would that work in your case?

Thanks,
Miklos