On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote: > On Thu, 30 Jul 2020, Roman Gushchin wrote: > > On Wed, Jul 29, 2020 at 08:45:47PM -0700, Hugh Dickins wrote: > > > > > > But a better idea is perhaps to redefine the behavior of > > > "echo >/proc/sys/vm/stat_refresh". What if > > > "echo someparticularstring >/proc/sys/vm/stat_refresh" were to > > > disable or enable the warning (permanently? or just that time?): > > > disable would be more "back-compatible", but I think it's okay > > > if you prefer enable. Or "someparticularstring" could actually > > > specify the warning threshold you want to use - you might echo > > > 125 or 16000, I might echo 0. We can haggle over the default. > > > > May I ask you, what kind of problems you have in your in mind, > > which can be revealed by these warnings? Or maybe there is some > > history attached? > > Yes: 52b6f46bc163 mentions finding a bug of mine in NR_ISOLATED_FILE > accounting, but IIRC (though I might be making this up) there was > also a bug in the NR_ACTIVE or NR_INACTIVE FILE or ANON accounting. > > When one of the stats used for balancing or limiting in vmscan.c > trends increasingly negative, it becomes increasingly difficult > for those heuristics (adding on to others, comparing with others) > to do what they're intended to do: they behave increasingly weirdly. > > Now the same (or the opposite) is true if one of those stats trends > increasingly positive: but if it leaks positive, it's visible in > /proc/vmstat; whereas if it leaks negative, it's presented there as 0. > > And most of the time (when unsynchronized) showing 0 is much better > than showing a transient negative. But to help fix bugs, we do need > some way of seeing the negatives, and vm/stat_refresh provides an > opportunity to do so, when it synchronizes. > > I'd be glad not to show the transients if I knew them: set a flag > on any that go negative, and only show if negative twice or more > in a row? Perhaps, but I don't relish adding that, and think it > would be over-engineering. > > It does sound to me like echoing the warning threshold into > /proc/sys/vm/stat_refresh is the best way to satisfy us both. > > Though another alternative did occur to me overnight: we could > scrap the logged warning, and show "nr_whatever -53" as output > from /proc/sys/vm/stat_refresh: that too would be acceptable > to me, and you redirect to /dev/null. It sounds like a good idea to me. Do you want me to prepare a patch? > > (Why did I choose -53 in my example? An in-joke: when I looked > through our machines for these warnings, on old kernels with my > old shmem hugepage implementation, there were a striking number > with "nr_shmem_freeholes -53"; but I'm a few years too late to > investigate what was going on there.) :) > > > > > If it's all about some particular counters, which are known to be > > strictly positive, maybe we should do the opposite, and check only > > those counters? Because in general it's not an indication of a problem. > > Yet it's very curious how few stats ever generate such warnings: > you're convinced they're just transient noise, and you're probably right; > but I am a little suspicious of whether they are accounted correctly. Yeah, I was initially very suspicious too, but I didn't find any issues and still think it's not an indication of a problem. Thank you!