On Sun, 28 Feb 2021, Roman Gushchin wrote: > On Thu, Feb 25, 2021 at 03:14:03PM -0800, Hugh Dickins wrote: > > vmstat_refresh() can occasionally catch nr_zone_write_pending and > > nr_writeback when they are transiently negative. The reason is partly > > that the interrupt which decrements them in test_clear_page_writeback() > > can come in before __test_set_page_writeback() got to increment them; > > but transient negatives are still seen even when that is prevented, and > > we have not yet resolved why (Roman believes that it is an unavoidable > > consequence of the refresh scheduled on each cpu). But those stats are > > not buggy, they have never been seen to drift away from 0 permanently: > > so just avoid the annoyance of showing a warning on them. > > > > Similarly avoid showing a warning on nr_free_cma: CMA users have seen > > that one reported negative from /proc/sys/vm/stat_refresh too, but it > > does drift away permanently: I believe that's because its incrementation > > and decrementation are decided by page migratetype, but the migratetype > > of a pageblock is not guaranteed to be constant. > > > > Use switch statements so we can most easily add or remove cases later. > > I'm OK with the code, but I can't fully agree with the commit log. I don't think > there is any mystery around negative values. Let me copy-paste the explanation > from my original patch: > > These warnings* are generated by the vmstat_refresh() function, which > assumes that atomic zone and numa counters can't go below zero. However, > on a SMP machine it's not quite right: due to per-cpu caching it can in > theory be as low as -(zone threshold) * NR_CPUs. > > For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES > reached 0. Then we've reclaimed a small number of cma pages on each CPU > except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are slightly > positive (the atomic counter is still 0). Then somebody on CPU0 consumes > all these pages. The number of pages can easily exceed the threshold and > a negative value will be committed to the atomic counter. > > * warnings about negative NR_FREE_CMA_PAGES Hi Roman, thanks for your Acks on the others - and indeed this is the one on which disagreement was more to be expected. I certainly wanted (and included below) a Link to your original patch; and even wondered whether to paste your description into mine. But I read it again and still have issues with it. Mainly, it does not convey at all, that touching stat_refresh adds the per-cpu counts into the global atomics, resetting per-cpu counts to 0. Which does not invalidate your explanation: races might still manage to underflow; but it does take the "easily" out of "can easily exceed". Since I don't use CMA on any machine, I cannot be sure, but it looked like a bad example to rely upon, because of its migratetype-based accounting. If you use /proc/sys/vm/stat_refresh frequently enough, without suppressing the warning, I guess that uncertainty could be resolved by checking whether nr_free_cma is seen with negative value in consecutive refreshes - which would tend to support my migratetype theory - or only singly - which would support your raciness theory. > > Actually, the same is almost true for ANY other counter. What differs CMA, dirty > and write pending counters is that they can reach 0 value under normal conditions. > Other counters are usually not reaching values small enough to see negative values > on a reasonable sized machine. Looking through /proc/vmstat now, yes, I can see that there are fewer counters which hover near 0 than I had imagined: more have a positive bias, or are monotonically increasing. And I'd be lying if I said I'd never seen any others than nr_writeback or nr_zone_write_pending caught negative. But what are you asking for? Should the patch be changed, to retry the refresh_vm_stats() before warning, if it sees any negative? Depends on how terrible one line in dmesg is considered! > > Does it makes sense? I'm not sure: you were not asking for the patch to be changed, but its commit log: and I better not say "Roman believes that it is an unavoidable consequence of the refresh scheduled on each cpu" if that's untrue (or unclear: now it reads to me as if we're accusing the refresh of messing things up, whereas it's the non-atomic nature of the refresh which leaves it vulnerable to races). Hugh > > > > > Link: https://lore.kernel.org/linux-mm/20200714173747.3315771-1-guro@xxxxxx/ > > Reported-by: Roman Gushchin <guro@xxxxxx> > > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> > > --- > > > > mm/vmstat.c | 15 +++++++++++++++ > > 1 file changed, 15 insertions(+) > > > > --- vmstat2/mm/vmstat.c 2021-02-25 11:56:18.000000000 -0800 > > +++ vmstat3/mm/vmstat.c 2021-02-25 12:42:15.000000000 -0800 > > @@ -1840,6 +1840,14 @@ int vmstat_refresh(struct ctl_table *tab > > if (err) > > return err; > > for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { > > + /* > > + * Skip checking stats known to go negative occasionally. > > + */ > > + switch (i) { > > + case NR_ZONE_WRITE_PENDING: > > + case NR_FREE_CMA_PAGES: > > + continue; > > + } > > val = atomic_long_read(&vm_zone_stat[i]); > > if (val < 0) { > > pr_warn("%s: %s %ld\n", > > @@ -1856,6 +1864,13 @@ int vmstat_refresh(struct ctl_table *tab > > } > > #endif > > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > > + /* > > + * Skip checking stats known to go negative occasionally. > > + */ > > + switch (i) { > > + case NR_WRITEBACK: > > + continue; > > + } > > val = atomic_long_read(&vm_node_stat[i]); > > if (val < 0) { > > pr_warn("%s: %s %ld\n", >