On 8/21/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > On Mon, Aug 21, 2023 at 02:07:28PM -0700, Dennis Zhou wrote: >> On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote: >> > With this out of the way I'll be looking at some form of caching to >> > eliminate these allocs as a problem. >> > >> >> I'm not against caching, this is just my first thought. Caching will >> have an impact on the backing pages of percpu. All it takes is 1 >> allocation on a page for the current allocator to pin n pages of memory. >> A few years ago percpu depopulation was implemented so that limits the >> amount of resident backing pages. >> > > I'm painfully aware. > >> Maybe the right thing to do is preallocate pools of common sized >> allocations so that way they can be recycled such that we don't have to >> think too hard about fragmentation that can occur if we populate these >> pools over time? >> > > This is what I was going to suggest :) > > FreeBSD has a per-cpu allocator which pretends to be the same as the > slab allocator, except handing out per-cpu bufs. So far it has sizes 4, > 8, 16, 32 and 64 and you can act as if you are mallocing in that size. > > Scales perfectly fine of course since it caches objs per-CPU, but there > is some waste and I have 0 idea how it compares to what Linux is doing > on that front. > > I stress though that even if you were to carve out certain sizes, a > global lock to handle ops will still kill scalability. > > Perhaps granularity better than global, but less than per-CPU would be a > sweet spot for scalabability vs memory waste. > > That said... > >> Also as you've pointed out, it wasn't just the percpu allocation being >> the bottleneck, but percpu_counter's global lock too for hotplug >> support. I'm hazarding a guess most use cases of percpu might have >> additional locking requirements too such as percpu_counter. >> > > True Fix(tm) is a longer story. > > Maybe let's sort out this patchset first, whichever way. :) > So I found the discussion around the original patch with a perf regression report. https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/ The reporter suggests dodging the problem by only allocating per-cpu counters when the process is going multithreaded. Given that there is still plenty of forever single-threaded procs out there I think that's does sound like a great plan regardless of what happens with this patchset. Almost all access is already done using dedicated routines, so this should be an afternoon churn to sort out, unless I missed a showstopper. (maybe there is no good place to stuff a flag/whatever other indicator about the state of counters?) That said I'll look into it some time this or next week. >> Thanks, >> Dennis >> >> > Thoughts? >> > >> > Mateusz Guzik (2): >> > pcpcntr: add group allocation/free >> > fork: group allocation of per-cpu counters for mm struct >> > >> > include/linux/percpu_counter.h | 19 ++++++++--- >> > kernel/fork.c | 13 ++------ >> > lib/percpu_counter.c | 61 ++++++++++++++++++++++++---------- >> > 3 files changed, 60 insertions(+), 33 deletions(-) >> > >> > -- >> > 2.39.2 >> > > -- Mateusz Guzik <mjguzik gmail.com>