Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

Nicolas Saenz Julienne <nsaenzju@xxxxxxxxxx> · Mon, 28 Mar 2022 15:51:43 +0200

Hi Mel,

On Fri, 2022-03-25 at 10:48 +0000, Mel Gorman wrote:
> > [1] It follows this pattern:
> > 
> > 	struct per_cpu_pages *pcp;
> > 
> > 	pcp = raw_cpu_ptr(page_zone(page)->per_cpu_pageset);
> > 	// <- Migration here is OK: spin_lock protects vs eventual pcplist
> > 	// access from local CPU as long as all list access happens through the
> > 	// pcp pointer.
> > 	spin_lock(&pcp->lock);
> > 	do_stuff_with_pcp_lists(pcp);
> > 	spin_unlock(&pcp->lock);
> > 
> 
> And this was the part I am concerned with. We are accessing a PCP
> structure that is not necessarily the one belonging to the CPU we
> are currently running on. This type of pattern is warned about in
> Documentation/locking/locktypes.rst
> 
> ---8<---
> A typical scenario is protection of per-CPU variables in thread context::
> 
>   struct foo *p = get_cpu_ptr(&var1);
> 
>   spin_lock(&p->lock);
>   p->count += this_cpu_read(var2);
> 
> This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel
> this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does
> not allow to acquire p->lock because get_cpu_ptr() implicitly disables
> preemption. The following substitution works on both kernels::
> ---8<---
> 
> Now we don't explicitly have this pattern because there isn't an
> obvious this_cpu_read() for example but it can accidentally happen for
> counting. __count_zid_vm_events -> __count_vm_events -> raw_cpu_add is
> an example although a harmless one.
> 
> Any of the mod_page_state ones are more problematic though because we
> lock one PCP but potentially update the per-cpu pcp stats of another CPU
> of a different PCP that we have not locked and those counters must be
> accurate.

But IIUC vmstats don't track pcplist usage (i.e. adding a page into the local
pcplist doesn't affect the count at all). It is only when interacting with the
buddy allocator that they get updated. It makes sense for the CPU that
adds/removes pages from the allocator to do the stat update, regardless of the
page's journey.

> It *might* still be safe but it's subtle, it could be easily accidentally
> broken in the future and it would be hard to detect because it would be
> very slow corruption of VM counters like NR_FREE_PAGES that must be
> accurate.

What does accurate mean here? vmstat consumers don't get accurate data, only
snapshots. And as I comment above you can't infer information about pcplist
usage from these stats. So, I see no real need for CPU locality when updating
them (which we're still retaining nonetheless, as per my comment above), the
only thing that is really needed is atomicity, achieved by disabling IRQs (and
preemption on RT). And this, even with your solution, is achieved through the
struct zone's spin_lock (plus a preempt_disable() in RT).

All in all, my point is that none of the stats are affected by the change, nor
have a dependency with the pcplists handling. And if we ever have the need to
pin vmstat updates to pcplist usage they should share the same pcp structure.
That said, I'm happy with either solution as long as we get remote pcplist
draining. So if still unconvinced, let me know how can I help. I have access to
all sorts of machines to validate perf results, time to review, or even to move
the series forward.

Thanks!

-- 
Nicolás Sáenz