Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, 2023-02-05 at 11:49 -0800, Roman Gushchin wrote:
> Hi Leonardo!

Hello Roman,
Thanks a lot for replying!

> 
> > Yes, but we are exchanging an "always schedule_work_on()", which is a kind of
> > contention, for a "sometimes we hit spinlock contention".
> > 
> > For the spinlock proposal, on the local cpu side, the *worst case* contention
> > is:
> > 1 - wait the spin_unlock() for a complete <percpu cache drain process>,
> > 2 - wait a cache hit for local per-cpu cacheline 
> > 
> > What is current implemented (schedule_work_on() approach), for the local
> > cpu side there is *always* this contention:
> > 1 - wait for a context switch,
> > 2 - wait a cache hit from it's local per-cpu cacheline,
> > 3 - wait a complete <percpu cache drain process>, 
> > 4 - then for a new context switch to the current thread.
> 
> I think both Michal and me are thinking of a more generic case in which the cpu
> is not exclusively consumed by 1 special process, so that the draining work can
> be executed during an idle time. In this case the work is basically free.

Oh, it makes sense.
But in such scenarios, wouldn't the same happens to spinlocks?

I mean, most of the contention with spinlocks only happens if the remote cpu is
trying to drain the cache while the local cpu happens to be draining/charging,
which is quite rare due to how fast the local cpu operations are.

Also, if the cpu has some idle time, using a little more on a possible spinlock
contention should not be a problem. Right?

> 
> And the introduction of a spin_lock() on the hot path is what we're are concerned
> about. I agree, that on some hardware platforms it won't be that expensive, 
> 

IIRC most hardware platforms with multicore supported by the kernel should have
the same behavior, since it's better to rely on cache coherence than locking the
memory bus.

For instance, the other popular architectures supported by Linux use the LR/SC
strategy for atomic operations (tested on ARM, POWER, RISCV) and IIRC the
LoadReserve slow part waits for the cacheline exclusivity, which is already
already exclusive in this perCPU structure.


> but in general not having any spinlocks is so much better.

I agree that spinlocks may bring contention, which is not ideal in many cases.
In this case, though, it may not be a big issue, due to very rare remote access
in the structure, for the usual case (non-pre-OOMCG)

> 
> > 
> > So moving from schedule_work_on() to spinlocks will save 2 context switches per
> > cpu every time drain_all_stock() is called.
> > 
> > On the remote cpu side, my tests point that doing the remote draining is faster
> > than scheduling a local draining, so it's also a gain.
> > 
> > Also, IIUC the possible contention in the spinlock approach happens only on
> > page-faulting and syscalls, versus the schedule_work_on() approach that can
> > interrupt user workload at any time. 
> > 
> > In fact, not interrupting the user workload in isolated cpus is just a bonus of
> > using spinlocks.
> 
> I believe it significantly depends on the preemption model: you're right regarding
> fully preemptive kernels, but with voluntary/none preemption it's exactly opposite:
> the draining work will be executed at some point later (probably with 0 cost),

So, in case of voluntary/none preemption with some free cpu time. 

> while the remote access from another cpu will potentially cause delays on the
> spin lock as well as a need to refill the stock.

But if there is some free CPU time, what is the issue of some (potential) delays
due to spinlock contention?

I am probably missing the whole picture, but when I think of performance
improvement, I think on doing more with the same cputime. If we can use free
cputime to do stuff later, it's only fair to also use it in case of contention,
right?

I know there are some cases that may need to be more previsible (mostly RT), but
when I think of memory allocation, I don't expect it to always take the same
time (as there are caches, pre-OOM, and so)

Also, as previously discussed, in case of a busy cpu, the spinlock approach will
probably allow more work to be done.

> 
> Overall I'd expect a noticeable performance regression from an introduction of
> spin locks and remote draining. Maybe not on all platforms, but at least on some.
> That's my main concern.
> 

I see. 
For the platform I have tested (x86) I noticed better overall performance on
spinlocks than upstream solution. For other popular platforms, I have briefly
read some documentation on locking/atomicity and I think we may keep the
performance gains.

But to be sure, I could retake the tests on other platforms, such as ARM, POWER,
RISCV, and so. Or even perform extra suggested tests.

With that info, would you feel less concerned about a possible change in memcg
pcp cache locking scheme?


>  And I don't think the problem we're aiming to solve here
> justifies this potential regression.
> 

To be strict, the isolated cpu scheduling problem is already fixed by the
housekeeping patch (with some limitations). 

At this point, I am trying to bring focus to a (possible) performance
improvement on the memcg pcp cache locking system.


> Thanks!
> 

Thank you for helping me better understand your arguments and concerns.
I really appreciate it!

Best regards,
Leo





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]     [Monitors]

  Powered by Linux