On Mon, Mar 28, 2022 at 03:51:43PM +0200, Nicolas Saenz Julienne wrote: > > Now we don't explicitly have this pattern because there isn't an > > obvious this_cpu_read() for example but it can accidentally happen for > > counting. __count_zid_vm_events -> __count_vm_events -> raw_cpu_add is > > an example although a harmless one. > > > > Any of the mod_page_state ones are more problematic though because we > > lock one PCP but potentially update the per-cpu pcp stats of another CPU > > of a different PCP that we have not locked and those counters must be > > accurate. > > But IIUC vmstats don't track pcplist usage (i.e. adding a page into the local > pcplist doesn't affect the count at all). It is only when interacting with the > buddy allocator that they get updated. It makes sense for the CPU that > adds/removes pages from the allocator to do the stat update, regardless of the > page's journey. > It probably doesn't, I didn't audit it. As I said, it's subtle which is why I'm wary of relying on accidental safety of getting a per-cpu pointer that may not be stable. Even if it was ok *now*, I would worry that it would break in the future. There already has been cases where patches tried to move vmstats outside the appropriate locking accidentally. > > It *might* still be safe but it's subtle, it could be easily accidentally > > broken in the future and it would be hard to detect because it would be > > very slow corruption of VM counters like NR_FREE_PAGES that must be > > accurate. > > What does accurate mean here? vmstat consumers don't get accurate data, only > snapshots. They are accurate in that they have "Eventual Consistency". zone_page_state_snapshot exists to get a more accurate count but there is always some drift but it still is accurate eventually. There is a clear distinction between VM counters which can be inaccurate they are just to assist debugging and vmstats like NR_FREE_PAGES that the kernel uses to make decisions. It potentially gets very problematic if a per-cpu pointer acquired from one zone gets migrated to another zone and the wrong vmstat is updated. It *might* still be ok, I haven't audited it but if there is a possible that two CPUs can be doing a RMW on one per-cpu vmstat structure, it will corrupt and it'll be difficult to detect. > And as I comment above you can't infer information about pcplist > usage from these stats. So, I see no real need for CPU locality when updating > them (which we're still retaining nonetheless, as per my comment above), the > only thing that is really needed is atomicity, achieved by disabling IRQs (and > preemption on RT). And this, even with your solution, is achieved through the > struct zone's spin_lock (plus a preempt_disable() in RT). > Yes, but under the series I had, I was using local_lock to stabilise what CPU is being used before acquiring the per-cpu pointer. Strictly speaking, it doesn't need a local_lock but the local_lock is clearer in terms of what is being protected and it works with PROVE_LOCKING which already caught a problematic softirq interaction for me when developing the series. > All in all, my point is that none of the stats are affected by the change, nor > have a dependency with the pcplists handling. And if we ever have the need to > pin vmstat updates to pcplist usage they should share the same pcp structure. > That said, I'm happy with either solution as long as we get remote pcplist > draining. So if still unconvinced, let me know how can I help. I have access to > all sorts of machines to validate perf results, time to review, or even to move > the series forward. > I also want the remote draining for PREEMPT_RT to avoid interference of isolated CPUs due to workqueue activity but whatever the solution, I would be happier if the per-cpu lock is acquired with the CPU stablised and covers the scope of any vmstat delta updates stored in the per-cpu structure. The earliest I will be rebasing my series is 5.18-rc1 as I see limited value in basing it on 5.17 aiming for a 5.19 merge window. -- Mel Gorman SUSE Labs