Re: [PATCH] mm/memcontrol: Disable on PREEMPT_RT

Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> · Fri, 10 Dec 2021 16:22:01 +0100

On 2021-12-07 11:55:38 [-0500], Johannes Weiner wrote:
> On Tue, Dec 07, 2021 at 04:52:08PM +0100, Sebastian Andrzej Siewior wrote:
> > From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> > 
> > MEMCG has a few constructs which are not compatible with PREEMPT_RT's
> > requirements. This includes:
> > - relying on disabled interrupts from spin_lock_irqsave() locking for
> >   something not related to lock itself (like the per-CPU counter).
> 
> If memory serves me right, this is the VM_BUG_ON() in workingset.c:
> 
> 	VM_WARN_ON_ONCE(!irqs_disabled());  /* For __inc_lruvec_page_state */
> 
> This isn't memcg specific. This is the serialization model of the
> generic MM page counters. They can be updated from process and irq
> context, and need to avoid preemption (and corruption) during RMW.
> 
> !CONFIG_MEMCG:
> 
> static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
> 					 int val)
> {
> 	struct page *page = virt_to_head_page(p);
> 
> 	mod_node_page_state(page_pgdat(page), idx, val);
> }
> 
> which does:
> 
> void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
> 					long delta)
> {
> 	unsigned long flags;
> 
> 	local_irq_save(flags);
> 	__mod_node_page_state(pgdat, item, delta);
> 	local_irq_restore(flags);
> }
> 
> If this breaks PREEMPT_RT, it's broken without memcg too.

The mod_node_page_state() looks fine. But if we use disabling interrupts
as protecting the RMW operation then this has be used everywhere and can
not be assumed to be inherited from spin_lock_irq(). Also, none of the
code here should be invoked from IRQ context on PREEMPT_RT.

If the locking scope is known then local_irq_disable() could be replaced
with a local_lock_t to avoid other things that appear unrelated like the
memcg_check_events() invocation in uncharge_batch(). The problematic
part here is mem_cgroup_tree_per_node::lock which can not be acquired
with disabled interrupts on PREEMPT_RT.
The "locking scope" is not always clear to me.
Also, if it is _just_ the counter, then we might solve this differently.

> > - explicitly disabling interrupts and acquiring a spinlock_t based lock
> >   like in memcg_check_events() -> eventfd_signal().
> 
> Similar problem to the above: we disable interrupts to protect RMW
> sequences that can (on non-preemptrt) be initiated through process
> context as well as irq context.
> 
> IIUC, the PREEMPT_RT construct for handling exactly that scenario is
> the "local lock". Is that correct?

On !PREEMPT_RT this_cpu_inc() can be used in hard-IRQ and task context
equally while __this_cpu_inc() is "optimized" to be used in IRQ-context/
a context where it can not be interrupted during its operation.
local_irq_save() and spin_lock_irq() both disable interrupts here.

On PREEMPT_RT chances are high that the code never runs with disabled
interrupts. local_irq_save() disables interrupts, yes, but
spin_lock_irq() does not.
Therefore a per-object lock, say address_space::i_pages, can
not be used to protect an otherwise unrelated per-CPU data, a global
DEFINE_PER_CPU(). The reason is that you can acquire
address_space::i_pages and get preempted in the middle of
__this_cpu_inc(). Then another task on the same CPU can acquire
address_space::i_pages of another struct address_space and perform
__this_cpu_inc() on the very same per-CPU date. There is your
interruption of a RMW operation.

local_lock_t is a per-CPU lock which can be used to synchronize access
to per-CPU variables which are otherwise unprotected / rely on disabled
preemption / interrupts. So yes, it could be used as a substitute in
situations where the !PREEMPT_RT needs to manually disable interrupts.

So this:

|func1(struct address_space *m)
|{
|  spin_lock_irq(&m->i_pages);
|  /* other m changes */
|  __this_cpu_add(counter);
|  spin_unlock_irq(&m->i_pages);
|}
|
|func2(void)
|{
|  local_irq_disable();
|  __this_cpu_add(counter);
|  local_irq_enable();
|}

construct breaks on PREEMPT_RT. With local_lock_t that would be:

|func1(struct address_space *m)
|{
|  spin_lock_irq(&m->i_pages);
|  /* other m changes */
|  local_lock(&counter_lock);
|  __this_cpu_add(counter);
|  local_unlock(&counter_lock);
|  spin_unlock_irq(&m->i_pages);
|}
|
|func2(void)
|{
|  local_lock_irq(&counter_lock);
|  __this_cpu_add(counter);
|  local_unlock_irq(&counter_lock);
|}

Ideally you would attach counter_lock to the same struct where the
counter is defined so the protection scope is obvious.
As you see, the local_irq_disable() was substituted with a
local_lock_irq() but also a local_lock() was added to func1().

> It appears Ingo has already fixed the LRU cache, which for non-rt also
> relies on irq disabling:
> 
> commit b01b2141999936ac3e4746b7f76c0f204ae4b445
> Author: Ingo Molnar <mingo@xxxxxxxxxx>
> Date:   Wed May 27 22:11:15 2020 +0200
> 
>     mm/swap: Use local_lock for protection
> 
> The memcg charge cache should be fixable the same way.
> 
> Likewise, if you fix the generic vmstat counters like this, the memcg
> implementation can follow suit.

The vmstat counters should be fixed since commit
   c68ed7945701a ("mm/vmstat: protect per cpu variables with preempt disable on RT")

again by Ingo.

We need to agree how to proceed with these counters. And then we can tackle
what is left things :)
It should be enough to disable preemption during the update since on
PREEMPT_RT that update does not happen in IRQ context.

Sebastian