Re: [PATCH v2] Increase page and bit waitqueue hash size

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 17 Mar 2021 16:13:32 -0700

On Wed, Mar 17, 2021 at 3:23 PM Nicholas Piggin <npiggin@xxxxxxxxx> wrote:
>
> Might take some time to get a system and run tests. We actually had
> difficulty recreating it before this patch too, so it's kind of
> hard to say _that_ was the exact case that previously ran badly and
> is now fixed. We thought just the statistical nature of collisions
> and page / lock contention made things occasionally line up and
> tank.

Yeah, it probably depends a lot on the exact page allocation patterns
and just bad luck.  Plus the actual spinlocks will end up with false
sharing anyway, so you might need to have both contention on multiple
pages on the same hash queue _and_ perhaps some action on other pages
that just hash close-by so that they make the cache coherency protocol
have to work harder..

And then 99% of the time it all just works out fine, because we only
need to look at the hashed spinlocks when we have any contention on
the page lock in the first place, and that generally should be
something we should avoid for other reasons.

But it is perhaps also a sign that while 256 entries is "ridiculously
small", it might be the case that it's not necessarily much of a real
problem in the end, and I get the feeling that growing it by several
orders of magnitude is overkill.

In fact, the problems we have had with the page wait queue have often
been because we did too much page locking in the first place.

I actually have a couple of patches in my "still thinking about it
tree" (that have been there for six months by now, so maybe not
thinking too _actively_ about it) which remove some overly stupid page
locking.

For example, look at "do_shared_fault()". Currently __do_fault()
always locks the result page. Then if you have a page_mkwrite()
function, we immediately unlock it again. And then we lock it again in
do_page_mkwrite().

Does that particular case matter? I really have no idea. But we
basically lock it twice for what looks like absolutely zero reason
other than calling conventions.

Bah. I'll just attach my three "not very actively thinking about it"
patches here in case anybody else wants to not actively think about
them.

I've actually been running these on my own machine since October,
since the kernel I actually boot on my main desktop always includes my
"thinking about it" patches.

The two first patches should fix the double lock thing I mentioned.

The third patch should be a no-op right now, but the thinking is
outlined in the commit message: why do we even lock pages in
filemap_fault? I'm not actually convinced we need to. I think we could
return an unputodate page unlocked, and instead do the checking at pte
install time (I had some discussions with Kirill about instead doing
it at pte install time, and depending entirely on memory ordering
constraints wrt page truncation that *does* take the page lock, and
then does various other checks - including the page mapcount and
taking the ptl lock - under that lock).

Anyway, I don't mind the "make the hash table larger" regardless of
this, but I do want it to be documented a bit more.

               Linus
From efdc3feadb493ae7f24c573c8b863d7a51117391 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Tue, 13 Oct 2020 10:22:00 -0700
Subject: [PATCH 1/3] mm: move final page locking out of __do_fault() helper
 into callers

The old semantics of our __do_fault() helper was that it always locked
the page unless there was an error (or unless the faulting had already
handled a COW event).

That turns out to be a mistake.  Not all callers actually want the page
locked at all, and they might as well check the same VM_FAULT_LOCKED bit
that __do_fault() itself checked whether the page is already locked or
not.

This change only moves that final page locking out into the callers, but
intentionally does not actually change any of the locking semantics: the
callers will not just do that final page locking themselves instead.

That means that future patches may then decide to not lock the page
after all, but this is just preparation for any such future change.

Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
---
 mm/memory.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5efa07fb6cdc..1e2796d68e43 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3647,11 +3647,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 		return VM_FAULT_HWPOISON;
 	}
 
-	if (unlikely(!(ret & VM_FAULT_LOCKED)))
-		lock_page(vmf->page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
-
 	return ret;
 }
 
@@ -3940,6 +3935,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	ret |= finish_fault(vmf);
 	unlock_page(vmf->page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
@@ -3971,6 +3971,11 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
 	if (ret & VM_FAULT_DONE_COW)
 		return ret;
 
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
 	__SetPageUptodate(vmf->cow_page);
 
@@ -3994,6 +3999,11 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	/*
 	 * Check if the backing address space wants to know that the page is
 	 * about to become writable
-- 
2.30.0.352.gf6a05684e6

From 1b2b061f622ea60924eba886975183306aa56972 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Tue, 13 Oct 2020 10:43:05 -0700
Subject: [PATCH 2/3] mm: don't lock the page, only to immediately unlock it
 again for do_page_mkwrite()

Our page locking during fault handling a bit messy, and the shared fault
code in particular was locking the page only to immediately unlock it
again because do_page_mkwrite() wanted it unlocked.

We keep the "did we lock it" state around in the VM_FAULT_LOCKED bit, so
let's just use that knowledge, and not first lock it if it wasn't
locked, only to then unlock it again.

It would be even better to transfer the "did we already lock this page"
information into do_page_mkwrite(), because that function will actually
want to lock it eventually anyway, but let's just clean up one thing at
a time.

Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
---
 mm/memory.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1e2796d68e43..d7e30273bef1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3999,25 +3999,33 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-	if (unlikely(!(ret & VM_FAULT_LOCKED)))
-		lock_page(vmf->page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
-
 	/*
 	 * Check if the backing address space wants to know that the page is
 	 * about to become writable
 	 */
 	if (vma->vm_ops->page_mkwrite) {
-		unlock_page(vmf->page);
+		/* do_page_mkwrite() wants the page unlocked */
+		if (ret & VM_FAULT_LOCKED) {
+			unlock_page(vmf->page);
+			ret &= ~VM_FAULT_LOCKED;
+		}
+
 		tmp = do_page_mkwrite(vmf);
 		if (unlikely(!tmp ||
 				(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
 			put_page(vmf->page);
 			return tmp;
 		}
+
+		/* Did do_page_mkwrite() lock the page again? */
+		ret |= tmp & VM_FAULT_LOCKED;
 	}
 
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	ret |= finish_fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
 					VM_FAULT_RETRY))) {
-- 
2.30.0.352.gf6a05684e6

From 7cf5ad81641214d2c4bb6404b64436a787791749 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Tue, 13 Oct 2020 11:00:33 -0700
Subject: [PATCH 3/3] mm: do_cow_fault() does not need the source page to be
 locked

This removes the "lock if it wasn't locked" logic from do_cow_fault(),
since we're not even going to install that page into the destination
address space (finish_fault() will use ->cow_page rather than ->page),
and copying the source page does not need the source to be locked.

So instead of doing "lock if it wasn't locked" followed by an
unconditional unlock of the page, just do "unlock if it was locked".

Of course, since all the normal file mapping ->fault() handlers
currently lock the page they return (see filemap_fault() for details),
all of this is pretty much theoretical.

But this is the right thing to do - making sure we hold the page lock
when we really don't is just confusing and wrong.  And this prepares the
way for any future changes to filemap_fault() where we go "Oh, we
actually _don't_ need to lock the page for this case at all".

Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
---
 mm/memory.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d7e30273bef1..44653b5b2af6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3971,16 +3971,13 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
 	if (ret & VM_FAULT_DONE_COW)
 		return ret;
 
-	if (unlikely(!(ret & VM_FAULT_LOCKED)))
-		lock_page(vmf->page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
-
 	copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
 	__SetPageUptodate(vmf->cow_page);
 
+	if (ret & VM_FAULT_LOCKED)
+		unlock_page(vmf->page);
+
 	ret |= finish_fault(vmf);
-	unlock_page(vmf->page);
 	put_page(vmf->page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
-- 
2.30.0.352.gf6a05684e6