On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka <vbabka@xxxxxxx> wrote: > On 1/21/21 11:01 AM, Christoph Lameter wrote: > > On Thu, 21 Jan 2021, Bharata B Rao wrote: > > > >> > The problem is that calculate_order() is called a number of times > >> > before secondaries CPUs are booted and it returns 1 instead of 224. > >> > This makes the use of num_online_cpus() irrelevant for those cases > >> > > >> > After adding in my command line "slub_min_objects=36" which equals to > >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224 > >> > , the regression diseapears: > >> > > >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%) > > I'm surprised that hackbench is that sensitive to slab performance, anyway. It's > supposed to be a scheduler benchmark? What exactly is going on? Uuuh, I think powerpc doesn't have cmpxchg_double? "vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And <https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family> says under "POWERPC": "no DW LL/SC" So powerpc is probably hitting the page-bitlock-based implementation all the time for stuff like __slub_free()? Do you have detailed profiling results from "perf top" or something like that? (I actually have some WIP patches and a design document for getting rid of cmpxchg_double in struct page that I hacked together in the last couple days; I'm currently in the process of sending them over to some other folks in the company who hopefully have cycles to review/polish/benchmark them so that they can be upstreamed, assuming that those folks think they're important enough. I don't have the cycles for it...)