We have a funny new bugzilla entry for an issue that is 12 years old, and not really all that critical, but one that *can* become a problem for people who do very specific things. What happens is that 'fork()' will cause a re-try (with ERESTARTNOINTR) if a signal has come in between the point where the fork() started, but before we add the process to the process table. The reason is that the signal possibly *should* affect the new child process too, but it was never queued to it because we were obviously in the process of creating it. That's normally entirely a non-issue, and I don't think anybody ever imagined it could matter in practice, but apparently there are loads where this becomes problematic. See kernel bugzilla at https://bugzilla.kernel.org/show_bug.cgi?id=200447 which has a trial balloon patch for this issue already, to at least limit that retry to only signals that actually might affect the child (ie not any random signals sent explicitly and directly only to the parent process). HOWEVER. The very first thing I noticed while looking at this was that one of the more expensive parts of the fork() retry is actually marking all the parent page tables read-only. Now, it's one of _many_ expensive parts, and it's not nearly as expensive as all the reference counting we do for each page, but it's actually very easy to avoid. When we have repeated fork() calls, there's just no point in repeatedly marking pages read-only. This the attached one-liner patch. The reason I'm sending it to the arch people is that while this is a totally trivial patch on x86 ("pte_write()" literally tests exactly the same bit that "pte_wrprotect()" and "ptep_set_wrprotect()" clears), the same is not necessarily always true on other architectures. Some other architectures make "ptep_set_wrprotect()" do more than just clear the one bit we test with "pte_write()". Honestly, I don't think it could possibly matter: if "pte_write()" returns false, then whatever "ptep_set_writeprotect()" does can not really matter (at least for a COW mapping). But I thought I'd send this out for comments anyway, despite how trivial it looks. So. Comments? It doesn't make a huge difference, honestly, and the real fix for the "retry too eagerly" is completely different, but at the same time this one-liner trivial fix does feel like the RightThing(tm) regardless. Linus
From 9c1b28c73cac29925511b6b5cf76e7f6860858f8 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> Date: Mon, 9 Jul 2018 13:19:49 -0700 Subject: [PATCH] mm/cow: don't bother write protecting already write-protected pages This is not normally noticeable, but repeated forks (either because the process itself uses fork() a lot, or because of the kernel retry due to signal handling) are unnecessarily expensive because they repeatedly dirty the parent page tables during the page table copy operation. It's trivial to just avoid write protecting the page table entry if it was already not writable. This patch was inspired by https://bugzilla.kernel.org/show_bug.cgi?id=200447 which points to an ancient "waste time re-doing fork" issue in the presence of lots of signals. It doesn't _fix_ the issue, but it avoids at least part of the unnecessary overhead when you do end up repeating a fork(). Cc: Eric Biederman <ebiederm@xxxxxxxxxxxx> Cc: Oleg Nesterov <oleg@xxxxxxxxxx> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> --- mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index 7206a634270b..57e61d50535f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1017,7 +1017,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, * If it's a COW mapping, write protect it both * in the parent and the child */ - if (is_cow_mapping(vm_flags)) { + if (is_cow_mapping(vm_flags) && pte_write(pte)) { ptep_set_wrprotect(src_mm, addr, src_pte); pte = pte_wrprotect(pte); } -- 2.18.0.132.g95eda3d86