On Tue, Feb 07, 2023 at 12:18:01AM -0300, Leonardo Brás wrote: > On Sun, 2023-02-05 at 11:49 -0800, Roman Gushchin wrote: > > Hi Leonardo! > > Hello Roman, > Thanks a lot for replying! > > > > > > Yes, but we are exchanging an "always schedule_work_on()", which is a kind of > > > contention, for a "sometimes we hit spinlock contention". > > > > > > For the spinlock proposal, on the local cpu side, the *worst case* contention > > > is: > > > 1 - wait the spin_unlock() for a complete <percpu cache drain process>, > > > 2 - wait a cache hit for local per-cpu cacheline > > > > > > What is current implemented (schedule_work_on() approach), for the local > > > cpu side there is *always* this contention: > > > 1 - wait for a context switch, > > > 2 - wait a cache hit from it's local per-cpu cacheline, > > > 3 - wait a complete <percpu cache drain process>, > > > 4 - then for a new context switch to the current thread. > > > > I think both Michal and me are thinking of a more generic case in which the cpu > > is not exclusively consumed by 1 special process, so that the draining work can > > be executed during an idle time. In this case the work is basically free. > > Oh, it makes sense. > But in such scenarios, wouldn't the same happens to spinlocks? > > I mean, most of the contention with spinlocks only happens if the remote cpu is > trying to drain the cache while the local cpu happens to be draining/charging, > which is quite rare due to how fast the local cpu operations are. > > Also, if the cpu has some idle time, using a little more on a possible spinlock > contention should not be a problem. Right? > > > > > And the introduction of a spin_lock() on the hot path is what we're are concerned > > about. I agree, that on some hardware platforms it won't be that expensive, > > > > IIRC most hardware platforms with multicore supported by the kernel should have > the same behavior, since it's better to rely on cache coherence than locking the > memory bus. > > For instance, the other popular architectures supported by Linux use the LR/SC > strategy for atomic operations (tested on ARM, POWER, RISCV) and IIRC the > LoadReserve slow part waits for the cacheline exclusivity, which is already > already exclusive in this perCPU structure. > > > > but in general not having any spinlocks is so much better. > > I agree that spinlocks may bring contention, which is not ideal in many cases. > In this case, though, it may not be a big issue, due to very rare remote access > in the structure, for the usual case (non-pre-OOMCG) Hi Leonardo! I made a very simple test: replaced pcp local_lock with a spinlock and ran a very simple kernel memory accounting benchmark (attached below) on my desktop pc (Intel Core i9-7940X). Original (3 attempts): 81341 us 80973 us 81258 us Patched (3 attempts): 99918 us 100220 us 100291 us This is without any contention and without CONFIG_LOCKDEP. Thanks! -- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 49f67176a1a2..bafd3cde4507 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6563,6 +6563,37 @@ static int memory_stat_show(struct seq_file *m, void *v) return 0; } +static int memory_alloc_test(struct seq_file *m, void *v) +{ + unsigned long i, j; + void **ptrs; + ktime_t start, end; + s64 delta, min_delta = LLONG_MAX; + + ptrs = kvmalloc(sizeof(void *) * 1024 * 1024, GFP_KERNEL); + if (!ptrs) + return -ENOMEM; + + for (j = 0; j < 100; j++) { + start = ktime_get(); + for (i = 0; i < 1024 * 1024; i++) + ptrs[i] = kmalloc(8, GFP_KERNEL_ACCOUNT); + end = ktime_get(); + + delta = ktime_us_delta(end, start); + if (delta < min_delta) + min_delta = delta; + + for (i = 0; i < 1024 * 1024; i++) + kfree(ptrs[i]); + } + + kvfree(ptrs); + seq_printf(m, "%lld us\n", min_delta); + + return 0; +} + #ifdef CONFIG_NUMA static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec, int item) @@ -6758,6 +6789,10 @@ static struct cftype memory_files[] = { .name = "stat", .seq_show = memory_stat_show, }, + { + .name = "test", + .seq_show = memory_alloc_test, + }, #ifdef CONFIG_NUMA { .name = "numa_stat",