Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit

Bernd Schubert <bernd.schubert@xxxxxxxxxxx> · Thu, 7 Mar 2024 23:06:07 +0100

Hi Jingbo,

On 3/7/24 03:16, Jingbo Xu wrote:
> Hi Bernd,
> 
> On 3/6/24 11:45 PM, Bernd Schubert wrote:
>>
>>
>> On 3/6/24 14:32, Jingbo Xu wrote:
>>>
>>>
>>> On 3/5/24 10:26 PM, Miklos Szeredi wrote:
>>>> On Mon, 26 Feb 2024 at 05:00, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>> Hi Miklos,
>>>>>
>>>>> On 1/26/24 2:29 PM, Jingbo Xu wrote:
>>>>>>
>>>>>>
>>>>>> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>>>>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> From: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx>
>>>>>>>>>
>>>>>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
>>>>>>>>> single request is increased.
>>>>>>>>
>>>>>>>> The only worry is about where this memory is getting accounted to.
>>>>>>>> This needs to be thought through, since the we are increasing the
>>>>>>>> possible memory that an unprivileged user is allowed to pin.
>>>>>>
>>>>>> Apart from the request size, the maximum number of background requests,
>>>>>> i.e. max_background (12 by default, and configurable by the fuse
>>>>>> daemon), also limits the size of the memory that an unprivileged user
>>>>>> can pin.  But yes, it indeed increases the number proportionally by
>>>>>> increasing the maximum request size.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This optimizes the write performance especially when the optimal IO size
>>>>>>>>> of the backend store at the fuse daemon side is greater than the original
>>>>>>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>>>>>>>>> 4096 PAGE_SIZE).
>>>>>>>>>
>>>>>>>>> Be noted that this only increases the upper limit of the maximum request
>>>>>>>>> size, while the real maximum request size relies on the FUSE_INIT
>>>>>>>>> negotiation with the fuse daemon.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx>
>>>>>>>>> Signed-off-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx>
>>>>>>>>> ---
>>>>>>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>>>>>>>>> Bytedance floks seems to had increased the maximum request size to 8M
>>>>>>>>> and saw a ~20% performance boost.
>>>>>>>>
>>>>>>>> The 20% is against the 256 pages, I guess.
>>>>>>>
>>>>>>> Yeah I guess so.
>>>>>>>
>>>>>>>
>>>>>>>> It would be interesting to
>>>>>>>> see the how the number of pages per request affects performance and
>>>>>>>> why.
>>>>>>>
>>>>>>> To be honest, I'm not sure the root cause of the performance boost in
>>>>>>> bytedance's case.
>>>>>>>
>>>>>>> While in our internal use scenario, the optimal IO size of the backend
>>>>>>> store at the fuse server side is, e.g. 4MB, and thus if the maximum
>>>>>>> throughput can not be achieved with current 256 pages per request. IOW
>>>>>>> the backend store, e.g. a distributed parallel filesystem, get optimal
>>>>>>> performance when the data is aligned at 4MB boundary.  I can ask my folk
>>>>>>> who implements the fuse server to give more background info and the
>>>>>>> exact performance statistics.
>>>>>>
>>>>>> Here are more details about our internal use case:
>>>>>>
>>>>>> We have a fuse server used in our internal cloud scenarios, while the
>>>>>> backend store is actually a distributed filesystem.  That is, the fuse
>>>>>> server actually plays as the client of the remote distributed
>>>>>> filesystem.  The fuse server forwards the fuse requests to the remote
>>>>>> backing store through network, while the remote distributed filesystem
>>>>>> handles the IO requests, e.g. process the data from/to the persistent store.
>>>>>>
>>>>>> Then it comes the details of the remote distributed filesystem when it
>>>>>> process the requested data with the persistent store.
>>>>>>
>>>>>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
>>>>>> (ErasureCode), where each fixed sized user data is split and stored as 8
>>>>>> data blocks plus 3 extra parity blocks. For example, with 512 bytes
>>>>>> block size, for each 4MB user data, it's split and stored as 8 (512
>>>>>> bytes) data blocks with 3 (512 bytes) parity blocks.
>>>>>>
>>>>>> It also utilize the stripe technology to boost the performance, for
>>>>>> example, there are 8 data disks and 3 parity disks in the above 8+3 mode
>>>>>> example, in which each stripe consists of 8 data blocks and 3 parity
>>>>>> blocks.
>>>>>>
>>>>>> [2] To avoid data corruption on power off, the remote distributed
>>>>>> filesystem commit a O_SYNC write right away once a write (fuse) request
>>>>>> received.  Since the EC described above, when the write fuse request is
>>>>>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the
>>>>>> other 3MB is read from the persistent store first, then compute the
>>>>>> extra 3 parity blocks with the complete 4MB stripe, and finally write
>>>>>> the 8 data blocks and 3 parity blocks down.
>>>>>>
>>>>>>
>>>>>> Thus the write amplification is un-neglectable and is the performance
>>>>>> bottleneck when the fuse request size is less than the stripe size.
>>>>>>
>>>>>> Here are some simple performance statistics with varying request size.
>>>>>> With 4MB stripe size, there's ~3x bandwidth improvement when the maximum
>>>>>> request size is increased from 256KB to 3.9MB, and another ~20%
>>>>>> improvement when the request size is increased to 4MB from 3.9MB.
>>>>
>>>> I sort of understand the issue, although my guess is that this could
>>>> be worked around in the client by coalescing writes.  This could be
>>>> done by adding a small delay before sending a write request off to the
>>>> network.
>>>>
>>>> Would that work in your case?
>>>
>>> It's possible but I'm not sure. I've asked my colleagues who working on
>>> the fuse server and the backend store, though have not been replied yet.
>>>  But I guess it's not as simple as increasing the maximum FUSE request
>>> size directly and thus more complexity gets involved.
>>>
>>> I can also understand the concern that this may increase the risk of
>>> pinning more memory footprint, and a more generic using scenario needs
>>> to be considered.  I can make it a private patch for our internal product.
>>>
>>> Thanks for the suggestions and discussion.
>>
>> It also gets kind of solved in my fuse-over-io-uring branch - as long as
>> there are enough free ring entries. I'm going to add in a flag there
>> that other CQEs might be follow up requests. Really time to post a new
>> version.
> 
> Thanks for the information.  I've not read the fuse-over-io-uring branch
> yet, but sounds like it would be much helpful .  Would there be a flag
> in the FUSE request indicating it's one of the linked FUSE requests?  Is
> this feature, say linked FUSE requests, enabled only when io-uring is
> upon FUSE?

Current development branch is this
https://github.com/bsbernd/linux/tree/fuse-uring-for-6.8
(It sometimes gets rebase/force pushes and incompatible changes - the
corresponding libfuse branch is also persistently updated).

Patches need clean up before I can send the next RFC version. And I
first want to change fixed single request size (not so nice to use 1MB
requests when 4K would be sufficient, for things like metadata and small
IO).

I just checked, struct fuse_write_in has a write_flags field

/**
 * WRITE flags
 *
 * FUSE_WRITE_CACHE: delayed write from page cache, file handle is guessed
 * FUSE_WRITE_LOCKOWNER: lock_owner field is valid
 * FUSE_WRITE_KILL_SUIDGID: kill suid and sgid bits
 */
#define FUSE_WRITE_CACHE	(1 << 0)
#define FUSE_WRITE_LOCKOWNER	(1 << 1)
#define FUSE_WRITE_KILL_SUIDGID (1 << 2)

I guess we could extend that and add flag that more pages are available
and will come in the next request - would avoid guessing and timeout on
the daemon/server side.
With uring that would be helpful as well, but then with uring one can
just look through available CQEs and see if these belong together. I
don't think there is much control right now on the kernel side to submit
multiple requests together but even without that I had seen consecutive
requests in a CQE completion round.

Bernd