On 4/8/24 01:32, Sweet Tea Dorminy wrote: > > On 2024-01-26 01:29, Jingbo Xu wrote: >> On 1/24/24 8:47 PM, Jingbo Xu wrote: >>> >>> >>> On 1/24/24 8:23 PM, Miklos Szeredi wrote: >>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> >>>> wrote: >>>>> >>>>> From: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx> >>>>> >>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of >>>>> a >>>>> single request is increased. >>>> >>>> The only worry is about where this memory is getting accounted to. >>>> This needs to be thought through, since the we are increasing the >>>> possible memory that an unprivileged user is allowed to pin. >> >> Apart from the request size, the maximum number of background requests, >> i.e. max_background (12 by default, and configurable by the fuse >> daemon), also limits the size of the memory that an unprivileged user >> can pin. But yes, it indeed increases the number proportionally by >> increasing the maximum request size. >> >> >>> >>>> It would be interesting to >>>> see the how the number of pages per request affects performance and >>>> why. >>> >>> To be honest, I'm not sure the root cause of the performance boost in >>> bytedance's case. >>> >>> While in our internal use scenario, the optimal IO size of the backend >>> store at the fuse server side is, e.g. 4MB, and thus if the maximum >>> throughput can not be achieved with current 256 pages per request. IOW >>> the backend store, e.g. a distributed parallel filesystem, get optimal >>> performance when the data is aligned at 4MB boundary. I can ask my >>> folk >>> who implements the fuse server to give more background info and the >>> exact performance statistics. >> >> Here are more details about our internal use case: >> >> We have a fuse server used in our internal cloud scenarios, while the >> backend store is actually a distributed filesystem. That is, the fuse >> server actually plays as the client of the remote distributed >> filesystem. The fuse server forwards the fuse requests to the remote >> backing store through network, while the remote distributed filesystem >> handles the IO requests, e.g. process the data from/to the persistent >> store. >> >> Then it comes the details of the remote distributed filesystem when it >> process the requested data with the persistent store. >> >> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC >> (ErasureCode), where each fixed sized user data is split and stored as >> 8 >> data blocks plus 3 extra parity blocks. For example, with 512 bytes >> block size, for each 4MB user data, it's split and stored as 8 (512 >> bytes) data blocks with 3 (512 bytes) parity blocks. >> >> It also utilize the stripe technology to boost the performance, for >> example, there are 8 data disks and 3 parity disks in the above 8+3 >> mode >> example, in which each stripe consists of 8 data blocks and 3 parity >> blocks. >> >> [2] To avoid data corruption on power off, the remote distributed >> filesystem commit a O_SYNC write right away once a write (fuse) request >> received. Since the EC described above, when the write fuse request is >> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, >> the >> other 3MB is read from the persistent store first, then compute the >> extra 3 parity blocks with the complete 4MB stripe, and finally write >> the 8 data blocks and 3 parity blocks down. >> >> >> Thus the write amplification is un-neglectable and is the performance >> bottleneck when the fuse request size is less than the stripe size. >> >> Here are some simple performance statistics with varying request size. >> With 4MB stripe size, there's ~3x bandwidth improvement when the >> maximum >> request size is increased from 256KB to 3.9MB, and another ~20% >> improvement when the request size is increased to 4MB from 3.9MB. > > To add my own performance statistics in a microbenchmark: > > Tested on both small VM and large hardware, with suitably large > FUSE_MAX_MAX_PAGES, using a simple fuse filesystem whose write handlers > did basically nothing but read the data buffers (memcmp() each 8 bytes > of data provided against a variable), I ran fio with 128M blocksize, > end_fsync=1, psync IO engine, times each of 4 parallel jobs. Throughput > was as follows over variable write_size in MB/s. > > write_size machine1 machine2 > 32M 1071 6425 > 16M 1002 6445 > 8M 890 6443 > 4M 713 6342 > 2M 557 6290 > 1M 404 6201 > 512K 268 6041 > 256K 156 5782 > > Even on the fast machine, increasing the buffer size to 8M is worth 3.9% > over keeping it at 1M, and is worth over 2x on the small VM. We are > striving to reduce the ingestion speed in particular as we have seen > that as a limiting factor on some machines, and there's a clear plateau > reached around 8M. While most fuse servers would likely not benefit from > this, and others would benefit from fuse passthrough instead, it does > seem like a performance win. > > Perhaps, in analogy to soft and hard limits on pipe size, > FUSE_MAX_MAX_PAGES could be increased and treated as the maximum > possible hard limit for max_write; and the default hard limit could stay > at 1M, thereby allowing folks to opt into the new behavior if they care > about the performance more than the memory? > > Sweet Tea As I recall the concern about increased message sizes is that it gives a process the ability to allocate non-insignificant amounts of kernel memory. Perhaps the limits could be expanded only if the server has SYS_ADMIN cap.