[alternative-merged] mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch removed from -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Wed, 03 Mar 2021 15:36:37 -0800

The patch titled
     Subject: mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings
has been removed from the -mm tree.  Its filename was
     mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch

This patch was dropped because an alternative patch was merged

------------------------------------------------------
From: Roman Gushchin <guro@xxxxxx>
Subject: mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings

I've noticed a number of warnings like "vmstat_refresh: nr_free_cma -5" or
"vmstat_refresh: nr_zone_write_pending -11" on our production hosts.  The
numbers of these warnings were relatively low and stable, so it didn't
look like we are systematically leaking the counters.  The corresponding
vmstat counters also looked sane.

These warnings are generated by the vmstat_refresh() function, which
assumes that atomic zone and numa counters can't go below zero.  However,
on a SMP machine it's not quite right: due to per-cpu caching it can in
theory be as low as -(zone threshold) * NR_CPUs.

For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
reached 0.  Then we've reclaimed a small number of cma pages on each CPU
except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are slightly
positive (the atomic counter is still 0).  Then somebody on CPU0 consumes
all these pages.  The number of pages can easily exceed the threshold and
a negative value will be committed to the atomic counter.

To fix the problem and avoid generating false warnings, let's just relax
the condition and warn only if the value is less than minus the maximum
theoretically possible drift value, which is 125 * number of online CPUs. 
It will still allow to catch systematic leaks, but will not generate bogus
warnings.

[guro@xxxxxx: fix warnings]
  Link: http://lkml.kernel.org/r/20200717174705.GA55916@xxxxxxxxxxxxxxxxxxxxxxxxxxx
[akpm@xxxxxxxxxxxxxxxxxxxx: comment MAX_THRESHOLD, per Michal]
Link: http://lkml.kernel.org/r/20200714173920.3319063-1-guro@xxxxxx
Signed-off-by: Roman Gushchin <guro@xxxxxx>
Acked-by: Vlastimil Babka <vbabka@xxxxxxx>
Acked-by: Michal Hocko <mhocko@xxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 Documentation/admin-guide/sysctl/vm.rst |    4 +-
 mm/vmstat.c                             |   32 ++++++++++++++--------
 2 files changed, 23 insertions(+), 13 deletions(-)

--- a/Documentation/admin-guide/sysctl/vm.rst~mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings
+++ a/Documentation/admin-guide/sysctl/vm.rst
@@ -824,8 +824,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc
 
 As a side-effect, it also checks for negative totals (elsewhere reported
 as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
-(At time of writing, a few stats are known sometimes to be found negative,
-with no ill effects: errors and warnings on these stats are suppressed.)
+(On a SMP machine some stats can temporarily become negative, with no ill
+effects: errors and warnings on these stats are suppressed.)
 
 
 numa_stat
--- a/mm/vmstat.c~mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings
+++ a/mm/vmstat.c
@@ -167,8 +167,14 @@ EXPORT_SYMBOL(vm_zone_stat);
 EXPORT_SYMBOL(vm_numa_stat);
 EXPORT_SYMBOL(vm_node_stat);
 
+/* Maximum sync threshold for per-cpu vmstat counters */
 #ifdef CONFIG_SMP
+#define MAX_THRESHOLD 125
+#else
+#define MAX_THRESHOLD 0
+#endif
 
+#ifdef CONFIG_SMP
 int calculate_pressure_threshold(struct zone *zone)
 {
 	int threshold;
@@ -186,11 +192,9 @@ int calculate_pressure_threshold(struct
 	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
 
 	/*
-	 * Maximum threshold is 125
+	 * Threshold is capped by MAX_THRESHOLD
 	 */
-	threshold = min(125, threshold);
-
-	return threshold;
+	return min(MAX_THRESHOLD, threshold);
 }
 
 int calculate_normal_threshold(struct zone *zone)
@@ -622,6 +626,7 @@ void dec_node_page_state(struct page *pa
 }
 EXPORT_SYMBOL(dec_node_page_state);
 #else
+
 /*
  * Use interrupt disable to serialize counter updates
  */
@@ -1838,7 +1843,7 @@ static void refresh_vm_stats(struct work
 int vmstat_refresh(struct ctl_table *table, int write,
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
-	long val;
+	long val, max_drift;
 	int err;
 	int i;
 
@@ -1849,17 +1854,22 @@ int vmstat_refresh(struct ctl_table *tab
 	 * pages, immediately after running a test.  /proc/sys/vm/stat_refresh,
 	 * which can equally be echo'ed to or cat'ted from (by root),
 	 * can be used to update the stats just before reading them.
-	 *
-	 * Oh, and since global_zone_page_state() etc. are so careful to hide
-	 * transiently negative values, report an error here if any of
-	 * the stats is negative, so we know to go looking for imbalance.
 	 */
 	err = schedule_on_each_cpu(refresh_vm_stats);
 	if (err)
 		return err;
+
+	/*
+	 * Since global_zone_page_state() etc. are so careful to hide
+	 * transiently negative values, report an error here if any of
+	 * the stats is negative and are less than the maximum drift value,
+	 * so we know to go looking for imbalance.
+	 */
+	max_drift = num_online_cpus() * MAX_THRESHOLD;
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_zone_stat[i]);
-		if (val < 0) {
+		if (val < -max_drift) {
 			pr_warn("%s: %s %ld\n",
 				__func__, zone_stat_name(i), val);
 			err = -EINVAL;
@@ -1868,7 +1878,7 @@ int vmstat_refresh(struct ctl_table *tab
 #ifdef CONFIG_NUMA
 	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_numa_stat[i]);
-		if (val < 0) {
+		if (val < -max_drift) {
 			pr_warn("%s: %s %ld\n",
 				__func__, numa_stat_name(i), val);
 			err = -EINVAL;
_

Patches currently in -mm which might be from guro@xxxxxx are