Re: [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer

Mateusz Guzik <mjguzik@xxxxxxxxx> · Mon, 24 Feb 2025 19:46:52 +0100

On Mon, Feb 24, 2025 at 10:02:09AM -0800, Shakeel Butt wrote:
> What about pre-memcg-charged sheaves? We had to disable memcg charging
> of some kernel allocations and I think sheaves can help in reenabling
> it.

It has been several months since last I looked at memcg, so details are
fuzzy and I don't have time to refresh everything.

However, if memory serves right the primary problem was the irq on/off
trip associated with them (sometimes happening twice, second time with
refill_obj_stock()).

I think the real fix(tm) would recognize only some allocations need
interrupt safety -- as in some slabs should not be allowed to be used
outside of the process context. This is somewhat what sheaves is doing,
but can be applied without fronting the current kmem caching mechanism.
This may be a tough sell and even then it plays whackamole with patching
up all consumers.

Suppose it is not an option.

Then there are 2 ways that I considered.

The easiest splits memcg accounting for irq and process level -- similar
to what localtry thing is doing. this would only cost preemption off/on
trip in the common case and a branch on the current state. But suppose
this is a no-go as well.

My primary idea was using hand-rolled sequence counters and local 8-byte
cmpxchg (*without* the lock prefix, also not to be confused with 16-byte
used by the current slub fast path). Should this work, it would be
significantly faster than irq trips. 

The irq thing is there only to facilitate several fields being updated
or memcg itself getting replaced in an atomic manner for process vs
interrupt context.

The observation is that all values which are getting updated are 4
bytes. Then perhaps an additional counter can be added next to each one
so that an 8-byte cmpxchg is going to fail should an irq swoop in and
change stuff from under us.

The percpu state would have a sequence counter associated with the
assigned memcg_stock_pcp. The memcg_stock_pcp object would have the same
value replicated inside for every var which can be updated in the fast
path.

Then the fast path would only succeed if the value read off from per-cpu
did not change vs what's in the stock thing.

Any change to memcg_stock_pcp (e.g., rolling up bytes after passing the
page size threshold) would disable interrupts and modify all these
counters.

There is some more work needed to make sure the stock obj can be safely
swapped out for a new one and not accidentally have a value which lines
up with the prevoius one, I don't remember what I had for that (and yes,
I recognize a 4 byte value will invariably roll over and *in principle*
a conflict will be possible).

This is a rough outline since Vlasta keeps prodding me about it.

That said, maybe someone will have a better idea. The above is up for
grabs if someone wants to do it, I can't commit to looking at it.