On Mon, Mar 22, 2021 at 02:57:18PM +0100, Christoph Hellwig wrote: > On Mon, Mar 22, 2021 at 06:03:21PM +0900, Sergey Senozhatsky wrote: > > On (21/03/22 08:15), Matthew Wilcox wrote: > > > > > > What's the scenario for which your allocator performs better than slub > > > > > > > IIRC request and reply buffers can be up to 4M in size. So this stuff > > just allocates a number of fat buffers and keeps them around so that > > it doesn't have to vmalloc(4M) for every request and every response. > > Do we have any data suggesting it is faster than vmalloc? Oh, I have no trouble believing it's faster than vmalloc. Here's the fast(!) path that always has memory available, never does retries. I'm calling out the things I perceive as expensive on the right hand side. Also, I'm taking the 4MB size as the example. vmalloc() __vmalloc_node() __vmalloc_node_range() __get_vm_area_node() [allocates vm_struct] alloc_vmap_area() [allocates vmap_area] [takes free_vmap_area_lock] __alloc_vmap_area() find_vmap_lowest_match [walks free_vmap_area_root] [takes vmap_area_lock] __vmalloc_area_node() ... array_size is 8KiB, we call __vmalloc_node __vmalloc_node [everything we did above, all over again, two more allocations, two more lock acquire] alloc_pages_node(), 1024 times vmap_pages_range_noflush() vmap_range_noflush() [allocate at least two pages for PTEs] There's definitely some low handling fruit here. __vmalloc_area_node() should probably call kvmalloc_node() instead of __vmalloc_node() for table sizes > 4KiB. But a lot of this is inherent to how vmalloc works, and we need to put a cache in front of it. Just not this one.