On Mon, 9 Jul 2018 16:57:46 -0700 Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > We have a funny new bugzilla entry for an issue that is 12 years old, > and not really all that critical, but one that *can* become a problem > for people who do very specific things. > > What happens is that 'fork()' will cause a re-try (with > ERESTARTNOINTR) if a signal has come in between the point where the > fork() started, but before we add the process to the process table. > The reason is that the signal possibly *should* affect the new child > process too, but it was never queued to it because we were obviously > in the process of creating it. > > That's normally entirely a non-issue, and I don't think anybody ever > imagined it could matter in practice, but apparently there are loads > where this becomes problematic. > > See kernel bugzilla at > > https://bugzilla.kernel.org/show_bug.cgi?id=200447 > > which has a trial balloon patch for this issue already, to at least > limit that retry to only signals that actually might affect the child > (ie not any random signals sent explicitly and directly only to the > parent process). > > HOWEVER. > > The very first thing I noticed while looking at this was that one of > the more expensive parts of the fork() retry is actually marking all > the parent page tables read-only. Now, it's one of _many_ expensive > parts, and it's not nearly as expensive as all the reference counting > we do for each page, but it's actually very easy to avoid. When we > have repeated fork() calls, there's just no point in repeatedly > marking pages read-only. > > This the attached one-liner patch. > > The reason I'm sending it to the arch people is that while this is a > totally trivial patch on x86 ("pte_write()" literally tests exactly > the same bit that "pte_wrprotect()" and "ptep_set_wrprotect()" > clears), the same is not necessarily always true on other > architectures. > > Some other architectures make "ptep_set_wrprotect()" do more than just > clear the one bit we test with "pte_write()". > > Honestly, I don't think it could possibly matter: if "pte_write()" > returns false, then whatever "ptep_set_writeprotect()" does can not > really matter (at least for a COW mapping). But I thought I'd send > this out for comments anyway, despite how trivial it looks. > > So. Comments? > > It doesn't make a huge difference, honestly, and the real fix for the > "retry too eagerly" is completely different, but at the same time this > one-liner trivial fix does feel like the RightThing(tm) regardless. s390 does software dirty-bit tracking which makes the logic for the page protection more bit more complicated. But !pte_write() always implies not writable. As a follow up patch we could remove the pte_write() check in the s390 version of ptep_set_wrprotect. diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 5ab636089c60..e8f9f3cc676a 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1069,8 +1069,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, { pte_t pte = *ptep; - if (pte_write(pte)) - ptep_xchg_lazy(mm, addr, ptep, pte_wrprotect(pte)); + ptep_xchg_lazy(mm, addr, ptep, pte_wrprotect(pte)); } #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.