+ mm-swap-piggyback-lru_add_drain_all-calls.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Sat, 05 Oct 2019 13:52:43 -0700

The patch titled
     Subject: mm/swap.c: piggyback lru_add_drain_all() calls
has been added to the -mm tree.  Its filename is
     mm-swap-piggyback-lru_add_drain_all-calls.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-swap-piggyback-lru_add_drain_all-calls.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-swap-piggyback-lru_add_drain_all-calls.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx>
Subject: mm/swap.c: piggyback lru_add_drain_all() calls

This is a very slow operation.  Right now POSIX_FADV_DONTNEED is the top
user because it has to freeze page references when removing it from the
cache.  invalidate_bdev() calls it for the same reason.  Both are
triggered from userspace, so it's easy to generate a storm.

mlock/mlockall no longer calls lru_add_drain_all - I've seen here
serious slowdown on older kernels.

There are some less obvious paths in memory migration/CMA/offlining which
shouldn't call frequently.


The worst case requires a non-trivial workload because lru_add_drain_all()
skips cpus where vectors are empty.  Something must constantly generate a
flow of pages for each cpu.  Also cpus must be busy to make scheduling
per-cpu works slower.  And the machine must be big enough (64+ cpus in our
case).

In our case that was a massive series of mlock calls in map-reduce while
other tasks write logs (and generates flows of new pages in per-cpu
vectors).  Mlock calls were serialized by mutex and accumulated latency up
to 10 seconds or more.

The kernel does not call lru_add_drain_all on mlock paths since 4.15, but
the same scenario could be triggered by fadvise(POSIX_FADV_DONTNEED) or
any other remaining user.


There is no reason to do the drain again if somebody else already drained
all the per-cpu vectors while we waited for the lock.

Piggyback on a drain starting and finishing while we wait for the lock:
all pages pending at the time of our entry were drained from the vectors.

Callers like POSIX_FADV_DONTNEED retry their operations once after
draining per-cpu vectors when pages have unexpected references.

Link: http://lkml.kernel.org/r/157019456205.3142.3369423180908482020.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx>
Reviewed-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxxxx>
Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/swap.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

--- a/mm/swap.c~mm-swap-piggyback-lru_add_drain_all-calls
+++ a/mm/swap.c
@@ -708,9 +708,10 @@ static void lru_add_drain_per_cpu(struct
  */
 void lru_add_drain_all(void)
 {
+	static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
 	static DEFINE_MUTEX(lock);
 	static struct cpumask has_work;
-	int cpu;
+	int cpu, seq;
 
 	/*
 	 * Make sure nobody triggers this path before mm_percpu_wq is fully
@@ -719,7 +720,19 @@ void lru_add_drain_all(void)
 	if (WARN_ON(!mm_percpu_wq))
 		return;
 
+	seq = raw_read_seqcount_latch(&seqcount);
+
 	mutex_lock(&lock);
+
+	/*
+	 * Piggyback on drain started and finished while we waited for lock:
+	 * all pages pended at the time of our enter were drained from vectors.
+	 */
+	if (__read_seqcount_retry(&seqcount, seq))
+		goto done;
+
+	raw_write_seqcount_latch(&seqcount);
+
 	cpumask_clear(&has_work);
 
 	for_each_online_cpu(cpu) {
@@ -740,6 +753,7 @@ void lru_add_drain_all(void)
 	for_each_cpu(cpu, &has_work)
 		flush_work(&per_cpu(lru_add_drain_work, cpu));
 
+done:
 	mutex_unlock(&lock);
 }
 #else
_

Patches currently in -mm which might be from khlebnikov@xxxxxxxxxxxxxx are

mm-swap-piggyback-lru_add_drain_all-calls.patch