On 8/4/21 11:54 AM, Mel Gorman wrote: > On Wed, Aug 04, 2021 at 01:54:47AM +0200, Thomas Gleixner wrote: >> >> <tglx> so in vmstat.c there is this magic comment: >> <tglx> * For use when we know that interrupts are disabled >> <tglx> * or when we know that preemption is disabled and that >> <tglx> * particular counter cannot be updated from interrupt context. >> <tglx> how can I know which counters need what? >> <mm_expert> I don't think there's a list, one would have to check on counter to counter basis :/ >> <tglx> and of course there is nothing which validates that, right? >> <mm_expert> exactly >> > > While I'm not "mm_expert", I agree with his/her statements. Phew, since you do, I can now disclose it was me. > Each counter > would need to be audited and two question are asked > > o If this counter is inaccurate, does anything break? > o If this counter is inaccurate, does it both increment and decrement > allowing the possibility it goes negative? > > The decision on that is looking at the counter and seeing if any > functional decision is made based on its value. So two examples; > > NR_VMSCAN_IMMEDIATE is a node-based counter that only every > increments and is reported to userspace. No kernel code makes > any decision based on its value. Therefore it's likely safe to > move to numa_stat_item instead. > > Action: move it > > WORKINGSET_ACTIVATE_FILE is a node-based counter that is used to > determine if a mem cgroup is potentially congested by looking at > the ratio of cgroup to node refault rates as well as deciding if > LRU file pages should be deactivate. If that value drifts, the > ratios are miscalculated and could lead to functional oddities > and therefore must be accurate. > > Action: leave it alone > > I guess it could be further split into state that must be accurate from > IRQ and non-IRQ contexts but that probably would be very fragile and > offer limited value. > >> Brilliant stuff which prevents you to do any validation on this. Over >> the years there have been several issues where callers had to be fixed >> by analysing bug reports instead of having a proper instrumentation in >> that code which would have told the developer that he got it wrong. >> > > I'm not sure it could be validated at build-time but I'm just back from > holiday and may be lacking imagination. The idea was not build-time, but runtime (hidden behind lockdep, VM_DEBUG or whatnot), i.e.: <sched_expert> what that code needs is switch(item) { case foo1: case foo2: lockdep_assert_irqs_disabled(); break; case bar1: case bar2: lockdep_assert_preempt_disabled(); lockdep_assert_no_in_irq(); break; } or something along those lines >> Of course on RT kernels the preempt_disable_rt() will serialize >> everything correctly, but as we have learned over the years just >> slapping _if_rt() or if_not_rt() variants of things around is most of >> the time papering over the underlying problem of badly defined >> protection scopes. Let's not proliferate that. As I said in the above >> IRC conversation: >> >> <tglx> I fundamentally hate this preempt_disable_rt() muck >> > > The issue is that even if this was properly audited and the inaccurate > and accurate counters were in the proper enums using the correct APIs, it > would still be necessary to protect the accurate counters from updates from > IRQ context. Hence, as I write this, I don't think preempt_[dis|en]able_rt > would go away and that is why I didn't continue with the series to break > out "accurate" stats >