On 8/21/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > On Mon, Aug 21, 2023 at 02:07:28PM -0700, Dennis Zhou wrote: >> On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote: >> > With this out of the way I'll be looking at some form of caching to >> > eliminate these allocs as a problem. >> > >> >> I'm not against caching, this is just my first thought. Caching will >> have an impact on the backing pages of percpu. All it takes is 1 >> allocation on a page for the current allocator to pin n pages of memory. >> A few years ago percpu depopulation was implemented so that limits the >> amount of resident backing pages. >> > > I'm painfully aware. > >> Maybe the right thing to do is preallocate pools of common sized >> allocations so that way they can be recycled such that we don't have to >> think too hard about fragmentation that can occur if we populate these >> pools over time? >> > > This is what I was going to suggest :) > > FreeBSD has a per-cpu allocator which pretends to be the same as the > slab allocator, except handing out per-cpu bufs. So far it has sizes 4, > 8, 16, 32 and 64 and you can act as if you are mallocing in that size. > > Scales perfectly fine of course since it caches objs per-CPU, but there > is some waste and I have 0 idea how it compares to what Linux is doing > on that front. > > I stress though that even if you were to carve out certain sizes, a > global lock to handle ops will still kill scalability. > > Perhaps granularity better than global, but less than per-CPU would be a > sweet spot for scalabability vs memory waste. > > That said... > >> Also as you've pointed out, it wasn't just the percpu allocation being >> the bottleneck, but percpu_counter's global lock too for hotplug >> support. I'm hazarding a guess most use cases of percpu might have >> additional locking requirements too such as percpu_counter. >> > > True Fix(tm) is a longer story. > So I had a meh sleep and I think I figured it out. Key insight is that with some hackery we can move counters *in active use* to different pagesets. More, I found the crux of the hackery is already used by percpu refcounting so I don't think it would be seen as too ugly (keyword: RCU). :) The word "allocate" is way overloaded for this discussion, so I'm going to use different terms for each use case. These would be: backing percpu machinery -- it *assigns* memory for use by caches caching layer -- it *hands out* memory to final consumers, memory is *cached* if not handed out Also the collection of per-cpu pages would be referred to as "pageset" (already used above). Say caches are per-cpu and are able to satisfy all requests as is. This means there is no contention on the backing percpu store. Mass import/export to/from caches can also handle the hotplug list -- everything cached or handed out can reside on it. Then handing out/returning a counter does not have to synchronize anything with other CPUs. While this obviously scales, there is the problem of memory waste you mentioned. For cached objects things are easy -- they can be unassigned *or* moved elsewhere, for example trade 2 counters holding a pageset hostage for 2 counters in a pageset which is already in heavy use. Nothing new in the above and for the sake of argument let's say this is not enough when it comes to fighting memory waste. Say we got really unlucky and there is a bunch of processes where each one happens to have counters in an otherwise empty pageset. So that's cpucount * processes pages spent when they all perhaps could fit in one * cpucount and that's not acceptable. The idea boils down to temporarily disabling per-cpu operation, waiting for everyone to observe the new state, changing the pointer to percpu stuff and re-enabling. A simplified remark is that this can be done with RCU, wrapping the regular consumers with rcu_read_lock and invoking synchronize_rcu to wait for them (there is more to it though, some fences may be needed). To sum up, it is very much possible to avoid any memory waste *and* have perfectly scalable percpu counter init/destroy, while adding next to no overhead to consumers of said counters. That said, I don't know if the complexity is worth it assuming the "only use percpu counters if going multithreaded" idea gets implemented. -- Mateusz Guzik <mjguzik gmail.com>