+ mm-vmstat-reduce-zone-lock-hold-time-when-reading-proc-pagetypeinfo.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Tue, 22 Oct 2019 14:59:29 -0700

The patch titled
     Subject: mm/vmstat: reduce zone lock hold time when reading /proc/pagetypeinfo
has been added to the -mm tree.  Its filename is
     mm-vmstat-reduce-zone-lock-hold-time-when-reading-proc-pagetypeinfo.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-vmstat-reduce-zone-lock-hold-time-when-reading-proc-pagetypeinfo.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-vmstat-reduce-zone-lock-hold-time-when-reading-proc-pagetypeinfo.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Waiman Long <longman@xxxxxxxxxx>
Subject: mm/vmstat: reduce zone lock hold time when reading /proc/pagetypeinfo

pagetypeinfo_showfree_print() prints out the number of free blocks for
each of the page orders and migrate types.  The current code just iterates
each of the free lists to get counts.  There are bug reports about hard
lockup panics when reading /proc/pagetypeinfo just because it look too
long to iterate all the free lists within a zone while holing the zone
lock with irq disabled.

Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
of crashing a system by the simple act of reading /proc/pagetypeinfo by
any user is a security problem that needs to be addressed.

There is a free_area structure associated with each page order.  There is
also a nr_free count within the free_area for all the different migration
types combined.  Tracking the number of free list entries for each
migration type will probably add some overhead to the fast paths like
moving pages from one migration type to another which may not be
desirable.

we can actually skip iterating the list of one of the migration types and
used nr_free to compute the missing count.  Since MIGRATE_MOVABLE is
usually the largest one on large memory systems, this is the one to be
skipped.  Since the printing order is migration-type => order, we will
have to store the counts in an internal 2D array before printing them out.

Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
zone lock for too long blocking out other zone lock waiters from being
run.  This can be problematic for systems with large amount of memory.  So
a check is added to temporarily release the lock and reschedule if more
than 64k of list entries have been iterated for each order.  With a
MAX_ORDER of 11, the worst case will be iterating about 700k of list
entries before releasing the lock.

Link: http://lkml.kernel.org/r/20191022162156.17316-1-longman@xxxxxxxxxx
Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxxxx>
Cc: Roman Gushchin <guro@xxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx>
Cc: Jann Horn <jannh@xxxxxxxxxx>
Cc: Song Liu <songliubraving@xxxxxx>
Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
Cc: Rafael Aquini <aquini@xxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/vmstat.c |   51 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 41 insertions(+), 10 deletions(-)

--- a/mm/vmstat.c~mm-vmstat-reduce-zone-lock-hold-time-when-reading-proc-pagetypeinfo
+++ a/mm/vmstat.c
@@ -1374,23 +1374,54 @@ static void pagetypeinfo_showfree_print(
 					pg_data_t *pgdat, struct zone *zone)
 {
 	int order, mtype;
+	unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
 
-	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
-		seq_printf(m, "Node %4d, zone %8s, type %12s ",
-					pgdat->node_id,
-					zone->name,
-					migratetype_names[mtype]);
-		for (order = 0; order < MAX_ORDER; ++order) {
+	lockdep_assert_held(&zone->lock);
+	lockdep_assert_irqs_disabled();
+
+	/*
+	 * MIGRATE_MOVABLE is usually the largest one in large memory
+	 * systems. We skip iterating that list. Instead, we compute it by
+	 * subtracting the total of the rests from free_area->nr_free.
+	 */
+	for (order = 0; order < MAX_ORDER; ++order) {
+		unsigned long nr_total = 0;
+		struct free_area *area = &(zone->free_area[order]);
+
+		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
 			unsigned long freecount = 0;
-			struct free_area *area;
 			struct list_head *curr;
 
-			area = &(zone->free_area[order]);
-
+			if (mtype == MIGRATE_MOVABLE)
+				continue;
 			list_for_each(curr, &area->free_list[mtype])
 				freecount++;
-			seq_printf(m, "%6lu ", freecount);
+			nfree[order][mtype] = freecount;
+			nr_total += freecount;
 		}
+		nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
+
+		/*
+		 * If we have already iterated more than 64k of list
+		 * entries, we might have hold the zone lock for too long.
+		 * Temporarily release the lock and reschedule before
+		 * continuing so that other lock waiters have a chance
+		 * to run.
+		 */
+		if (nr_total > (1 << 16)) {
+			spin_unlock_irq(&zone->lock);
+			cond_resched();
+			spin_lock_irq(&zone->lock);
+		}
+	}
+
+	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
+		seq_printf(m, "Node %4d, zone %8s, type %12s ",
+					pgdat->node_id,
+					zone->name,
+					migratetype_names[mtype]);
+		for (order = 0; order < MAX_ORDER; ++order)
+			seq_printf(m, "%6lu ", nfree[order][mtype]);
 		seq_putc(m, '\n');
 	}
 }
_

Patches currently in -mm which might be from longman@xxxxxxxxxx are

mm-vmstat-reduce-zone-lock-hold-time-when-reading-proc-pagetypeinfo.patch