Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

Roman Gushchin <roman.gushchin@xxxxxxxxx> · Wed, 8 Feb 2023 11:23:50 -0800

On Tue, Feb 07, 2023 at 12:18:01AM -0300, Leonardo Brás wrote:
> On Sun, 2023-02-05 at 11:49 -0800, Roman Gushchin wrote:
> > Hi Leonardo!
> 
> Hello Roman,
> Thanks a lot for replying!
> 
> > 
> > > Yes, but we are exchanging an "always schedule_work_on()", which is a kind of
> > > contention, for a "sometimes we hit spinlock contention".
> > > 
> > > For the spinlock proposal, on the local cpu side, the *worst case* contention
> > > is:
> > > 1 - wait the spin_unlock() for a complete <percpu cache drain process>,
> > > 2 - wait a cache hit for local per-cpu cacheline 
> > > 
> > > What is current implemented (schedule_work_on() approach), for the local
> > > cpu side there is *always* this contention:
> > > 1 - wait for a context switch,
> > > 2 - wait a cache hit from it's local per-cpu cacheline,
> > > 3 - wait a complete <percpu cache drain process>, 
> > > 4 - then for a new context switch to the current thread.
> > 
> > I think both Michal and me are thinking of a more generic case in which the cpu
> > is not exclusively consumed by 1 special process, so that the draining work can
> > be executed during an idle time. In this case the work is basically free.
> 
> Oh, it makes sense.
> But in such scenarios, wouldn't the same happens to spinlocks?
> 
> I mean, most of the contention with spinlocks only happens if the remote cpu is
> trying to drain the cache while the local cpu happens to be draining/charging,
> which is quite rare due to how fast the local cpu operations are.
> 
> Also, if the cpu has some idle time, using a little more on a possible spinlock
> contention should not be a problem. Right?
> 
> > 
> > And the introduction of a spin_lock() on the hot path is what we're are concerned
> > about. I agree, that on some hardware platforms it won't be that expensive, 
> > 
> 
> IIRC most hardware platforms with multicore supported by the kernel should have
> the same behavior, since it's better to rely on cache coherence than locking the
> memory bus.
> 
> For instance, the other popular architectures supported by Linux use the LR/SC
> strategy for atomic operations (tested on ARM, POWER, RISCV) and IIRC the
> LoadReserve slow part waits for the cacheline exclusivity, which is already
> already exclusive in this perCPU structure.
> 
> 
> > but in general not having any spinlocks is so much better.
> 
> I agree that spinlocks may bring contention, which is not ideal in many cases.
> In this case, though, it may not be a big issue, due to very rare remote access
> in the structure, for the usual case (non-pre-OOMCG)

Hi Leonardo!

I made a very simple test: replaced pcp local_lock with a spinlock and ran
a very simple kernel memory accounting benchmark (attached below) on
my desktop pc (Intel Core i9-7940X).

Original (3 attempts):
81341 us
80973 us
81258 us

Patched (3 attempts):
99918 us
100220 us
100291 us

This is without any contention and without CONFIG_LOCKDEP.

Thanks!

--

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 49f67176a1a2..bafd3cde4507 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6563,6 +6563,37 @@ static int memory_stat_show(struct seq_file *m, void *v)
        return 0;
 }

+static int memory_alloc_test(struct seq_file *m, void *v)
+{
+       unsigned long i, j;
+       void **ptrs;
+       ktime_t start, end;
+       s64 delta, min_delta = LLONG_MAX;
+
+       ptrs = kvmalloc(sizeof(void *) * 1024 * 1024, GFP_KERNEL);
+       if (!ptrs)
+               return -ENOMEM;
+
+       for (j = 0; j < 100; j++) {
+               start = ktime_get();
+               for (i = 0; i < 1024 * 1024; i++)
+                       ptrs[i] = kmalloc(8, GFP_KERNEL_ACCOUNT);
+               end = ktime_get();
+
+               delta = ktime_us_delta(end, start);
+               if (delta < min_delta)
+                       min_delta = delta;
+
+               for (i = 0; i < 1024 * 1024; i++)
+                       kfree(ptrs[i]);
+       }
+
+       kvfree(ptrs);
+       seq_printf(m, "%lld us\n", min_delta);
+
+       return 0;
+}
+
 #ifdef CONFIG_NUMA
 static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
                                                     int item)
@@ -6758,6 +6789,10 @@ static struct cftype memory_files[] = {
                .name = "stat",
                .seq_show = memory_stat_show,
        },
+       {
+               .name = "test",
+               .seq_show = memory_alloc_test,
+       },
 #ifdef CONFIG_NUMA
        {
                .name = "numa_stat",