Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit

Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> · Mon, 26 Feb 2024 12:00:34 +0800

Hi Miklos,

On 1/26/24 2:29 PM, Jingbo Xu wrote:
> 
> 
> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>
>>
>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>> From: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx>
>>>>
>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
>>>> single request is increased.
>>>
>>> The only worry is about where this memory is getting accounted to.
>>> This needs to be thought through, since the we are increasing the
>>> possible memory that an unprivileged user is allowed to pin.
> 
> Apart from the request size, the maximum number of background requests,
> i.e. max_background (12 by default, and configurable by the fuse
> daemon), also limits the size of the memory that an unprivileged user
> can pin.  But yes, it indeed increases the number proportionally by
> increasing the maximum request size.
> 
> 
>>
>>>
>>>
>>>
>>>>
>>>> This optimizes the write performance especially when the optimal IO size
>>>> of the backend store at the fuse daemon side is greater than the original
>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>>>> 4096 PAGE_SIZE).
>>>>
>>>> Be noted that this only increases the upper limit of the maximum request
>>>> size, while the real maximum request size relies on the FUSE_INIT
>>>> negotiation with the fuse daemon.
>>>>
>>>> Signed-off-by: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx>
>>>> Signed-off-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx>
>>>> ---
>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>>>> Bytedance floks seems to had increased the maximum request size to 8M
>>>> and saw a ~20% performance boost.
>>>
>>> The 20% is against the 256 pages, I guess. 
>>
>> Yeah I guess so.
>>
>>
>>> It would be interesting to
>>> see the how the number of pages per request affects performance and
>>> why.
>>
>> To be honest, I'm not sure the root cause of the performance boost in
>> bytedance's case.
>>
>> While in our internal use scenario, the optimal IO size of the backend
>> store at the fuse server side is, e.g. 4MB, and thus if the maximum
>> throughput can not be achieved with current 256 pages per request. IOW
>> the backend store, e.g. a distributed parallel filesystem, get optimal
>> performance when the data is aligned at 4MB boundary.  I can ask my folk
>> who implements the fuse server to give more background info and the
>> exact performance statistics.
> 
> Here are more details about our internal use case:
> 
> We have a fuse server used in our internal cloud scenarios, while the
> backend store is actually a distributed filesystem.  That is, the fuse
> server actually plays as the client of the remote distributed
> filesystem.  The fuse server forwards the fuse requests to the remote
> backing store through network, while the remote distributed filesystem
> handles the IO requests, e.g. process the data from/to the persistent store.
> 
> Then it comes the details of the remote distributed filesystem when it
> process the requested data with the persistent store.
> 
> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
> (ErasureCode), where each fixed sized user data is split and stored as 8
> data blocks plus 3 extra parity blocks. For example, with 512 bytes
> block size, for each 4MB user data, it's split and stored as 8 (512
> bytes) data blocks with 3 (512 bytes) parity blocks.
> 
> It also utilize the stripe technology to boost the performance, for
> example, there are 8 data disks and 3 parity disks in the above 8+3 mode
> example, in which each stripe consists of 8 data blocks and 3 parity
> blocks.
> 
> [2] To avoid data corruption on power off, the remote distributed
> filesystem commit a O_SYNC write right away once a write (fuse) request
> received.  Since the EC described above, when the write fuse request is
> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the
> other 3MB is read from the persistent store first, then compute the
> extra 3 parity blocks with the complete 4MB stripe, and finally write
> the 8 data blocks and 3 parity blocks down.
> 
> 
> Thus the write amplification is un-neglectable and is the performance
> bottleneck when the fuse request size is less than the stripe size.
> 
> Here are some simple performance statistics with varying request size.
> With 4MB stripe size, there's ~3x bandwidth improvement when the maximum
> request size is increased from 256KB to 3.9MB, and another ~20%
> improvement when the request size is increased to 4MB from 3.9MB.
> 

gentle ping ...

I'm not sure if our using scenario described above is reasonable for
you.  Let me know if there's any other concern.

-- 
Thanks,
Jingbo