Can people please check this patch out for cross-arch issues

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 9 Jul 2018 16:57:46 -0700

We have a funny new bugzilla entry for an issue that is 12 years old,
and not really all that critical, but one that *can* become a problem
for people who do very specific things.

What happens is that 'fork()' will cause a re-try (with
ERESTARTNOINTR) if a signal has come in between the point where the
fork() started, but before we add the process to the process table.
The reason is that the signal possibly *should* affect the new child
process too, but it was never queued to it because we were obviously
in the process of creating it.

That's normally entirely a non-issue, and I don't think anybody ever
imagined it could matter in practice, but apparently there are loads
where this becomes problematic.

See kernel bugzilla at

    https://bugzilla.kernel.org/show_bug.cgi?id=200447

which has a trial balloon patch for this issue already, to at least
limit that retry to only signals that actually might affect the child
(ie not any random signals sent explicitly and directly only to the
parent process).

HOWEVER.

The very first thing I noticed while looking at this was that one of
the more expensive parts of the fork() retry is actually marking all
the parent page tables read-only. Now, it's one of _many_ expensive
parts, and it's not nearly as expensive as all the reference counting
we do for each page, but it's actually very easy to avoid. When we
have repeated fork() calls, there's just no point in repeatedly
marking pages read-only.

This the attached one-liner patch.

The reason I'm sending it to the arch people is that while this is a
totally trivial patch on x86 ("pte_write()" literally tests exactly
the same bit that "pte_wrprotect()" and "ptep_set_wrprotect()"
clears), the same is not necessarily always true on other
architectures.

Some other architectures make "ptep_set_wrprotect()" do more than just
clear the one bit we test with "pte_write()".

Honestly, I don't think it could possibly matter: if "pte_write()"
returns false, then whatever "ptep_set_writeprotect()" does can not
really matter (at least for a COW mapping). But I thought I'd send
this out for comments anyway, despite how trivial it looks.

So. Comments?

It doesn't make a huge difference, honestly, and the real fix for the
"retry too eagerly" is completely different, but at the same time this
one-liner trivial fix does feel like the RightThing(tm) regardless.

                 Linus
From 9c1b28c73cac29925511b6b5cf76e7f6860858f8 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Mon, 9 Jul 2018 13:19:49 -0700
Subject: [PATCH] mm/cow: don't bother write protecting already write-protected
 pages

This is not normally noticeable, but repeated forks (either because the
process itself uses fork() a lot, or because of the kernel retry due to
signal handling) are unnecessarily expensive because they repeatedly
dirty the parent page tables during the page table copy operation.

It's trivial to just avoid write protecting the page table entry if it
was already not writable.

This patch was inspired by

    https://bugzilla.kernel.org/show_bug.cgi?id=200447

which points to an ancient "waste time re-doing fork" issue in the
presence of lots of signals.  It doesn't _fix_ the issue, but it avoids
at least part of the unnecessary overhead when you do end up repeating a
fork().

Cc: Eric Biederman <ebiederm@xxxxxxxxxxxx>
Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7206a634270b..57e61d50535f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1017,7 +1017,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * If it's a COW mapping, write protect it both
 	 * in the parent and the child
 	 */
-	if (is_cow_mapping(vm_flags)) {
+	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
 		ptep_set_wrprotect(src_mm, addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
-- 
2.18.0.132.g95eda3d86