Hi Bernd, Thanks for the comment. On 7/5/24 7:50 PM, Bernd Schubert wrote: > > > On 7/5/24 12:04, Jingbo Xu wrote: >> Sometimes the file offset alignment needs to be opt-in to achieve the >> optimum performance at the backend store. >> >> For example when ErasureCode [1] is used at the backend store, the >> optimum write performance is achieved when the WRITE request is aligned >> with the stripe size of ErasureCode. Otherwise a non-aligned WRITE >> request needs to be split at the stripe size boundary. It is quite >> costly to handle these split partial requests, as firstly the whole >> stripe to which the split partial request belongs needs to be read out, >> then overwrite the read stripe buffer with the request, and finally write >> the whole stripe back to the persistent storage. >> >> Thus the backend store can suffer severe performance degradation when >> WRITE requests can not fit into one stripe exactly. The write performance >> can be 10x slower when the request is 256KB in size given 4MB stripe size. >> Also there can be 50% performance degradation in theory if the request >> is not stripe boundary aligned. >> >> Besides, the conveyed test indicates that, the non-alignment issue >> becomes more severe when decreasing fuse's max_ratio, maybe partly >> because the background writeback now is more likely to run parallelly >> with the dirtier. >> >> fuse's max_ratio ratio of aligned WRITE requests >> ---------------- ------------------------------- >> 70 99.9% >> 40 74% >> 20 45% >> 10 20% >> >> With the patched version, which makes the alignment constraint opt-in >> when constructing WRITE requests, the ratio of aligned WRITE requests >> increases to 98% (previously 20%) when fuse's max_ratio is 10. >> >> [1] https://lore.kernel.org/linux-fsdevel/20240124070512.52207-1-jefflexu@xxxxxxxxxxxxxxxxx/T/#m9bce469998ea6e4f911555c6f7be1e077ce3d8b4 >> Signed-off-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> >> >> Signed-off-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> >> --- >> fs/fuse/file.c | 4 ++++ >> fs/fuse/fuse_i.h | 6 ++++++ >> fs/fuse/inode.c | 9 +++++++++ >> include/uapi/linux/fuse.h | 9 ++++++++- >> 4 files changed, 27 insertions(+), 1 deletion(-) >> >> diff --git a/fs/fuse/file.c b/fs/fuse/file.c >> index f39456c65ed7..f9b477016c2e 100644 >> --- a/fs/fuse/file.c >> +++ b/fs/fuse/file.c >> @@ -2246,6 +2246,10 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct page *page, >> if (ap->num_pages == data->max_pages && !fuse_pages_realloc(data)) >> return true; >> >> + /* Reached alignment */ >> + if (fc->opt_alignment && !(page->index % fc->opt_alignment_pages)) >> + return true; > > I link the idea, but couldn't it just do that with > fc->max_pages? I'm asking because fc->opt_alignment_pages > takes another uint32_t in fuse_init_out and there is not so much > space left anymore. I'm okay with resuing max_pages as the alignment constraint. They are the same in our internal scenarios. But I'm not sure if it is the case in other scenarios. >> /* >> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c >> index 99e44ea7d875..9266b22cce8e 100644 >> --- a/fs/fuse/inode.c >> +++ b/fs/fuse/inode.c >> @@ -1331,6 +1331,15 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args, >> } >> if (flags & FUSE_NO_EXPORT_SUPPORT) >> fm->sb->s_export_op = &fuse_export_fid_operations; >> + >> + /* fallback to default if opt_alignment <= PAGE_SHIFT */ >> + if (flags & FUSE_OPT_ALIGNMENT) { >> + if (arg->opt_alignment > PAGE_SHIFT) { >> + fc->opt_alignment = 1; >> + fc->opt_alignment_pages = 1 << >> + (arg->opt_alignment - PAGE_SHIFT); > > opt_alignment is the number of bits required for alignment? Not > very user friendly, from my point of view would be better to have > this in a byte or kb unit. > Actually I referred to fuse_init_out.map_alignment, which is also log2(byte alignment). Anyway I'm okay making it a more understandable name. -- Thanks, Jingbo