Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2024-01-26 01:29, Jingbo Xu wrote:
On 1/24/24 8:47 PM, Jingbo Xu wrote:


On 1/24/24 8:23 PM, Miklos Szeredi wrote:
On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:

From: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx>

Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
single request is increased.

The only worry is about where this memory is getting accounted to.
This needs to be thought through, since the we are increasing the
possible memory that an unprivileged user is allowed to pin.

Apart from the request size, the maximum number of background requests,
i.e. max_background (12 by default, and configurable by the fuse
daemon), also limits the size of the memory that an unprivileged user
can pin.  But yes, it indeed increases the number proportionally by
increasing the maximum request size.







This optimizes the write performance especially when the optimal IO size of the backend store at the fuse daemon side is greater than the original
maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
4096 PAGE_SIZE).

Be noted that this only increases the upper limit of the maximum request
size, while the real maximum request size relies on the FUSE_INIT
negotiation with the fuse daemon.

Signed-off-by: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx>
Signed-off-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx>
---
I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
Bytedance floks seems to had increased the maximum request size to 8M
and saw a ~20% performance boost.

The 20% is against the 256 pages, I guess.

Yeah I guess so.


It would be interesting to
see the how the number of pages per request affects performance and
why.

To be honest, I'm not sure the root cause of the performance boost in
bytedance's case.

While in our internal use scenario, the optimal IO size of the backend
store at the fuse server side is, e.g. 4MB, and thus if the maximum
throughput can not be achieved with current 256 pages per request. IOW
the backend store, e.g. a distributed parallel filesystem, get optimal
performance when the data is aligned at 4MB boundary. I can ask my folk
who implements the fuse server to give more background info and the
exact performance statistics.

Here are more details about our internal use case:

We have a fuse server used in our internal cloud scenarios, while the
backend store is actually a distributed filesystem.  That is, the fuse
server actually plays as the client of the remote distributed
filesystem.  The fuse server forwards the fuse requests to the remote
backing store through network, while the remote distributed filesystem
handles the IO requests, e.g. process the data from/to the persistent store.

Then it comes the details of the remote distributed filesystem when it
process the requested data with the persistent store.

[1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
(ErasureCode), where each fixed sized user data is split and stored as 8
data blocks plus 3 extra parity blocks. For example, with 512 bytes
block size, for each 4MB user data, it's split and stored as 8 (512
bytes) data blocks with 3 (512 bytes) parity blocks.

It also utilize the stripe technology to boost the performance, for
example, there are 8 data disks and 3 parity disks in the above 8+3 mode
example, in which each stripe consists of 8 data blocks and 3 parity
blocks.

[2] To avoid data corruption on power off, the remote distributed
filesystem commit a O_SYNC write right away once a write (fuse) request
received.  Since the EC described above, when the write fuse request is
not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the
other 3MB is read from the persistent store first, then compute the
extra 3 parity blocks with the complete 4MB stripe, and finally write
the 8 data blocks and 3 parity blocks down.


Thus the write amplification is un-neglectable and is the performance
bottleneck when the fuse request size is less than the stripe size.

Here are some simple performance statistics with varying request size.
With 4MB stripe size, there's ~3x bandwidth improvement when the maximum
request size is increased from 256KB to 3.9MB, and another ~20%
improvement when the request size is increased to 4MB from 3.9MB.

To add my own performance statistics in a microbenchmark:

Tested on both small VM and large hardware, with suitably large FUSE_MAX_MAX_PAGES, using a simple fuse filesystem whose write handlers did basically nothing but read the data buffers (memcmp() each 8 bytes of data provided against a variable), I ran fio with 128M blocksize, end_fsync=1, psync IO engine, times each of 4 parallel jobs. Throughput was as follows over variable write_size in MB/s.

write_size  machine1 machine2
32M	1071	6425
16M	1002	6445
8M	890	6443
4M	713	6342
2M	557	6290
1M	404	6201
512K	268	6041
256K	156	5782

Even on the fast machine, increasing the buffer size to 8M is worth 3.9% over keeping it at 1M, and is worth over 2x on the small VM. We are striving to reduce the ingestion speed in particular as we have seen that as a limiting factor on some machines, and there's a clear plateau reached around 8M. While most fuse servers would likely not benefit from this, and others would benefit from fuse passthrough instead, it does seem like a performance win.

Perhaps, in analogy to soft and hard limits on pipe size, FUSE_MAX_MAX_PAGES could be increased and treated as the maximum possible hard limit for max_write; and the default hard limit could stay at 1M, thereby allowing folks to opt into the new behavior if they care about the performance more than the memory?

Sweet Tea




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux