On Thu, 27 Jul 2023 at 18:58, Jaco Kroon <jaco@xxxxxxxxx> wrote: > > Hi, > > On 2023/07/27 17:35, Miklos Szeredi wrote: > > On Thu, 27 Jul 2023 at 10:13, Jaco Kroon <jaco@xxxxxxxxx> wrote: > >> This patch does not mess with the caching infrastructure like the > >> previous one, which we believe caused excessive CPU and broke directory > >> listings in some cases. > >> > >> This version only affects the uncached read, which then during parse adds an > >> entry at a time to the cached structures by way of copying, and as such, > >> we believe this should be sufficient. > >> > >> We're still seeing cases where getdents64 takes ~10s (this was the case > >> in any case without this patch, the difference now that we get ~500 > >> entries for that time rather than the 14-18 previously). We believe > >> that that latency is introduced on glusterfs side and is under separate > >> discussion with the glusterfs developers. > >> > >> This is still a compile-time option, but a working one compared to > >> previous patch. For now this works, but it's not recommended for merge > >> (as per email discussion). > >> > >> This still uses alloc_pages rather than kvmalloc/kvfree. > >> > >> Signed-off-by: Jaco Kroon <jaco@xxxxxxxxx> > >> --- > >> fs/fuse/Kconfig | 16 ++++++++++++++++ > >> fs/fuse/readdir.c | 18 ++++++++++++------ > >> 2 files changed, 28 insertions(+), 6 deletions(-) > >> > >> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig > >> index 038ed0b9aaa5..0783f9ee5cd3 100644 > >> --- a/fs/fuse/Kconfig > >> +++ b/fs/fuse/Kconfig > >> @@ -18,6 +18,22 @@ config FUSE_FS > >> If you want to develop a userspace FS, or if you want to use > >> a filesystem based on FUSE, answer Y or M. > >> > >> +config FUSE_READDIR_ORDER > >> + int > >> + range 0 5 > >> + default 5 > >> + help > >> + readdir performance varies greatly depending on the size of the read. > >> + Larger buffers results in larger reads, thus fewer reads and higher > >> + performance in return. > >> + > >> + You may want to reduce this value on seriously constrained memory > >> + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal. > >> + > >> + This value reprents the order of the number of pages to allocate (ie, > >> + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32 > >> + pages (128KiB). > >> + > >> config CUSE > >> tristate "Character device in Userspace support" > >> depends on FUSE_FS > >> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c > >> index dc603479b30e..47cea4d91228 100644 > >> --- a/fs/fuse/readdir.c > >> +++ b/fs/fuse/readdir.c > >> @@ -13,6 +13,12 @@ > >> #include <linux/pagemap.h> > >> #include <linux/highmem.h> > >> > >> +#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER > >> +#define READDIR_PAGES (1 << READDIR_PAGES_ORDER) > >> +#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER) > >> +#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1) > >> +#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER) > >> + > >> static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx) > >> { > >> struct fuse_conn *fc = get_fuse_conn(dir); > >> @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx) > >> struct fuse_mount *fm = get_fuse_mount(inode); > >> struct fuse_io_args ia = {}; > >> struct fuse_args_pages *ap = &ia.ap; > >> - struct fuse_page_desc desc = { .length = PAGE_SIZE }; > >> + struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE }; > > Does this really work? I would've thought we are relying on single > > page lengths somewhere. > Based on my testing yes. Getting just under 500 entries per > getdents64() call from userspace vs 14-18 before I guess the answer is yes. > > > >> u64 attr_version = 0; > >> bool locked; > >> > >> - page = alloc_page(GFP_KERNEL); > >> + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER); > >> if (!page) > >> return -ENOMEM; > >> > >> plus = fuse_use_readdirplus(inode, ctx); > >> ap->args.out_pages = true; > >> - ap->num_pages = 1; > >> + ap->num_pages = READDIR_PAGES; > > No. This is the array lenght, which is 1. This is the hack I guess, > > which makes the above trick work. > > Oh? So the page referenced above isn't an array of pages? It's > actually a single piece of contiguous memory? Yes. > > > Better use kvmalloc, which might have a slightly worse performance > > than a large page, but definitely not worse than the current single > > page. > > Which returns a void*, not struct page* - guess this can be converted > using __virt_to_page (iirc)? No, it cannot be converted to a page or a page array, use it as just a piece of kernel memory. Which means: - don't set ->args.out_pages and ->num_pages - instead set ->args.out_args[0].value to the allocated pointer and that should be it (fingers crossed). > > > If we want to optimize the overhead of kvmalloc (and it's a big if) > > then the parse_dir*file() functions would need to be converted to > > using a page array instead of a plain kernel pointer, which would add > > some complexity for sure. > > Sorry, I read the above as "I'm surprised this works at all and you're > not kernel panicking all over the show", this is probably the most > ambitious kernel patch I've attempted to date. Good start, you'll learn. Thanks, Miklos