On Wed, Jan 17, 2024 at 2:20 PM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote: > > On Wed, Jan 17, 2024 at 01:02:19PM -0800, Shakeel Butt wrote: > > On Wed, Jan 17, 2024 at 12:21 PM Linus Torvalds > > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > > > On Wed, 17 Jan 2024 at 11:39, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote: > > > > > > > > That's a good point. If the microbenchmark isn't likely to be even > > > > remotely realistic, maybe we should just revert the revert until if/when > > > > somebody shows a real world impact. > > > > > > > > Linus, any objections to that? > > > > > > We use SLAB_ACCOUNT for much more common allocations like queued > > > signals, so I would tend to agree with Jeff that it's probably just > > > some not very interesting microbenchmark that shows any file locking > > > effects from SLAB_ALLOC, not any real use. > > > > > > That said, those benchmarks do matter. It's very easy to say "not > > > relevant in the big picture" and then the end result is that > > > everything is a bit of a pig. > > > > > > And the regression was absolutely *ENORMOUS*. We're not talking "a few > > > percent". We're talking a 33% regression that caused the revert: > > > > > > https://lore.kernel.org/lkml/20210907150757.GE17617@xsang-OptiPlex-9020/ > > > > > > I wish our SLAB_ACCOUNT wasn't such a pig. Rather than account every > > > single allocation, it would be much nicer to account at a bigger > > > granularity, possibly by having per-thread counters first before > > > falling back to the obj_cgroup_charge. Whatever. > > > > > > It's kind of stupid to have a benchmark that just allocates and > > > deallocates a file lock in quick succession spend lots of time > > > incrementing and decrementing cgroup charges for that repeated > > > alloc/free. > > > > > > However, that problem with SLAB_ACCOUNT is not the fault of file > > > locking, but more of a slab issue. > > > > > > End result: I think we should bring in Vlastimil and whoever else is > > > doing SLAB_ACCOUNT things, and have them look at that side. > > > > > > And then just enable SLAB_ACCOUNT for file locks. But very much look > > > at silly costs in SLAB_ACCOUNT first, at least for trivial > > > "alloc/free" patterns.. > > > > > > Vlastimil? Who would be the best person to look at that SLAB_ACCOUNT > > > thing? See commit 3754707bcc3e (Revert "memcg: enable accounting for > > > file lock caches") for the history here. > > > > > > > Roman last looked into optimizing this code path. I suspect > > mod_objcg_state() to be more costly than obj_cgroup_charge(). I will > > try to measure this path and see if I can improve it. > > It's roughly an equal split between mod_objcg_state() and obj_cgroup_charge(). > And each is comparable (by order of magnitude) to the slab allocation cost > itself. On the free() path a significant cost comes simple from reading > the objcg pointer (it's usually a cache miss). > > So I don't see how we can make it really cheap (say, less than 5% overhead) > without caching pre-accounted objects. > > I thought about merging of charge and stats handling paths, which _maybe_ can > shave off another 20-30%, but there still will be a double-digit% accounting > overhead. > > I'm curious to hear other ideas and suggestions. > > Thanks! I profiled (perf record -a) the same benchmark i.e. lock1_processes on an icelake machine with 72 cores and got the following results: 12.72% lock1_processes [kernel.kallsyms] [k] mod_objcg_state 10.89% lock1_processes [kernel.kallsyms] [k] kmem_cache_free 8.40% lock1_processes [kernel.kallsyms] [k] slab_post_alloc_hook 8.36% lock1_processes [kernel.kallsyms] [k] kmem_cache_alloc 5.18% lock1_processes [kernel.kallsyms] [k] refill_obj_stock 5.18% lock1_processes [kernel.kallsyms] [k] _copy_from_user On annotating mod_objcg_state(), the following irq disabling instructions are taking 30% of its time. 6.64 │ pushfq 10.26│ popq -0x38(%rbp) 6.05 │ mov -0x38(%rbp),%rcx 7.60 │ cli For kmem_cache_free() & kmem_cache_alloc(), the following instruction was expensive, which corresponds to __update_cpu_freelist_fast(). 16.33 │ cmpxchg16b %gs:(%rsi) For slab_post_alloc_hook(), it's all over the place and refill_obj_stock() is very similar to mod_objcg_state(). I will dig more in the next couple of days.