On 1/25/22 17:43, Sebastian Andrzej Siewior wrote: > The members of the per-CPU structure memcg_stock_pcp are protected > either by disabling interrupts or by disabling preemption if the > invocation occurred in process context. > Disabling interrupts protects most of the structure excluding task_obj > while disabling preemption protects only task_obj. > This schema is incompatible with PREEMPT_RT because it creates atomic > context in which actions are performed which require preemptible > context. One example is obj_cgroup_release(). > > The IRQ-disable and preempt-disable sections can be replaced with > local_lock_t which preserves the explicit disabling of interrupts while > keeps the code preemptible on PREEMPT_RT. > > The task_obj has been added for performance reason on non-preemptible > kernels where preempt_disable() is a NOP. On the PREEMPT_RT preemption > model preempt_disable() is always implemented. Also there are no memory > allocations in_irq() context and softirqs are processed in (preemptible) > process context. Therefore it makes sense to avoid using task_obj. > > Don't use task_obj on PREEMPT_RT and replace manual disabling of > interrupts with a local_lock_t. This change requires some factoring: > > - drain_obj_stock() drops a reference on obj_cgroup which leads to an > invocation of obj_cgroup_release() if it is the last object. This in > turn leads to recursive locking of the local_lock_t. To avoid this, > obj_cgroup_release() is invoked outside of the locked section. > > - drain_obj_stock() gets a memcg_stock_pcp passed if the stock_lock has been > acquired (instead of the task_obj_lock) to avoid recursive locking later > in refill_stock(). Looks like this was maybe true in some previous version but now drain_obj_stock() gets a bool parameter that is passed to obj_cgroup_uncharge_pages(). But drain_local_stock() uses a NULL or stock_pcp for that bool parameter which is weird. > - drain_all_stock() disables preemption via get_cpu() and then invokes > drain_local_stock() if it is the local CPU to avoid scheduling a worker > (which invokes the same function). Disabling preemption here is > problematic due to the sleeping locks in drain_local_stock(). > This can be avoided by always scheduling a worker, even for the local > CPU. Using cpus_read_lock() stabilizes cpu_online_mask which ensures > that no worker is scheduled for an offline CPU. Since there is no > flush_work(), it is still possible that a worker is invoked on the wrong > CPU but it is okay since it operates always on the local-CPU data. > > - drain_local_stock() is always invoked as a worker so it can be optimized > by removing in_task() (it is always true) and avoiding the "irq_save" > variant because interrupts are always enabled here. Operating on > task_obj first allows to acquire the lock_lock_t without lockdep > complains. > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> The problem is that this pattern where get_obj_stock() sets a stock_lock_acquried bool and this is passed down and acted upon elsewhere, is a well known massive red flag for Linus :/ Maybe we should indeed just revert 559271146efc, as Michal noted there were no hard numbers to justify it, and in previous discussion it seemed to surface that the costs of irq disable/enable are not that bad on recent cpus as assumed? > --- > mm/memcontrol.c | 174 +++++++++++++++++++++++++++++++----------------- > 1 file changed, 114 insertions(+), 60 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 3d1b7cdd83db0..2d8be88c00888 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -260,8 +260,10 @@ bool mem_cgroup_kmem_disabled(void) > return cgroup_memory_nokmem; > } > > +struct memcg_stock_pcp; Seems this forward declaration is unused. > static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg, > - unsigned int nr_pages); > + unsigned int nr_pages, > + bool stock_lock_acquried); > > static void obj_cgroup_release(struct percpu_ref *ref) > {