On Wed, Mar 17, 2021 at 3:23 PM Nicholas Piggin <npiggin@xxxxxxxxx> wrote: > > Might take some time to get a system and run tests. We actually had > difficulty recreating it before this patch too, so it's kind of > hard to say _that_ was the exact case that previously ran badly and > is now fixed. We thought just the statistical nature of collisions > and page / lock contention made things occasionally line up and > tank. Yeah, it probably depends a lot on the exact page allocation patterns and just bad luck. Plus the actual spinlocks will end up with false sharing anyway, so you might need to have both contention on multiple pages on the same hash queue _and_ perhaps some action on other pages that just hash close-by so that they make the cache coherency protocol have to work harder.. And then 99% of the time it all just works out fine, because we only need to look at the hashed spinlocks when we have any contention on the page lock in the first place, and that generally should be something we should avoid for other reasons. But it is perhaps also a sign that while 256 entries is "ridiculously small", it might be the case that it's not necessarily much of a real problem in the end, and I get the feeling that growing it by several orders of magnitude is overkill. In fact, the problems we have had with the page wait queue have often been because we did too much page locking in the first place. I actually have a couple of patches in my "still thinking about it tree" (that have been there for six months by now, so maybe not thinking too _actively_ about it) which remove some overly stupid page locking. For example, look at "do_shared_fault()". Currently __do_fault() always locks the result page. Then if you have a page_mkwrite() function, we immediately unlock it again. And then we lock it again in do_page_mkwrite(). Does that particular case matter? I really have no idea. But we basically lock it twice for what looks like absolutely zero reason other than calling conventions. Bah. I'll just attach my three "not very actively thinking about it" patches here in case anybody else wants to not actively think about them. I've actually been running these on my own machine since October, since the kernel I actually boot on my main desktop always includes my "thinking about it" patches. The two first patches should fix the double lock thing I mentioned. The third patch should be a no-op right now, but the thinking is outlined in the commit message: why do we even lock pages in filemap_fault? I'm not actually convinced we need to. I think we could return an unputodate page unlocked, and instead do the checking at pte install time (I had some discussions with Kirill about instead doing it at pte install time, and depending entirely on memory ordering constraints wrt page truncation that *does* take the page lock, and then does various other checks - including the page mapcount and taking the ptl lock - under that lock). Anyway, I don't mind the "make the hash table larger" regardless of this, but I do want it to be documented a bit more. Linus
From efdc3feadb493ae7f24c573c8b863d7a51117391 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> Date: Tue, 13 Oct 2020 10:22:00 -0700 Subject: [PATCH 1/3] mm: move final page locking out of __do_fault() helper into callers The old semantics of our __do_fault() helper was that it always locked the page unless there was an error (or unless the faulting had already handled a COW event). That turns out to be a mistake. Not all callers actually want the page locked at all, and they might as well check the same VM_FAULT_LOCKED bit that __do_fault() itself checked whether the page is already locked or not. This change only moves that final page locking out into the callers, but intentionally does not actually change any of the locking semantics: the callers will not just do that final page locking themselves instead. That means that future patches may then decide to not lock the page after all, but this is just preparation for any such future change. Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> --- mm/memory.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 5efa07fb6cdc..1e2796d68e43 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3647,11 +3647,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return VM_FAULT_HWPOISON; } - if (unlikely(!(ret & VM_FAULT_LOCKED))) - lock_page(vmf->page); - else - VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); - return ret; } @@ -3940,6 +3935,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf) if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; + if (unlikely(!(ret & VM_FAULT_LOCKED))) + lock_page(vmf->page); + else + VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); + ret |= finish_fault(vmf); unlock_page(vmf->page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) @@ -3971,6 +3971,11 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf) if (ret & VM_FAULT_DONE_COW) return ret; + if (unlikely(!(ret & VM_FAULT_LOCKED))) + lock_page(vmf->page); + else + VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); + copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); __SetPageUptodate(vmf->cow_page); @@ -3994,6 +3999,11 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf) if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; + if (unlikely(!(ret & VM_FAULT_LOCKED))) + lock_page(vmf->page); + else + VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); + /* * Check if the backing address space wants to know that the page is * about to become writable -- 2.30.0.352.gf6a05684e6
From 1b2b061f622ea60924eba886975183306aa56972 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> Date: Tue, 13 Oct 2020 10:43:05 -0700 Subject: [PATCH 2/3] mm: don't lock the page, only to immediately unlock it again for do_page_mkwrite() Our page locking during fault handling a bit messy, and the shared fault code in particular was locking the page only to immediately unlock it again because do_page_mkwrite() wanted it unlocked. We keep the "did we lock it" state around in the VM_FAULT_LOCKED bit, so let's just use that knowledge, and not first lock it if it wasn't locked, only to then unlock it again. It would be even better to transfer the "did we already lock this page" information into do_page_mkwrite(), because that function will actually want to lock it eventually anyway, but let's just clean up one thing at a time. Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> --- mm/memory.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 1e2796d68e43..d7e30273bef1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3999,25 +3999,33 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf) if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; - if (unlikely(!(ret & VM_FAULT_LOCKED))) - lock_page(vmf->page); - else - VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); - /* * Check if the backing address space wants to know that the page is * about to become writable */ if (vma->vm_ops->page_mkwrite) { - unlock_page(vmf->page); + /* do_page_mkwrite() wants the page unlocked */ + if (ret & VM_FAULT_LOCKED) { + unlock_page(vmf->page); + ret &= ~VM_FAULT_LOCKED; + } + tmp = do_page_mkwrite(vmf); if (unlikely(!tmp || (tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) { put_page(vmf->page); return tmp; } + + /* Did do_page_mkwrite() lock the page again? */ + ret |= tmp & VM_FAULT_LOCKED; } + if (unlikely(!(ret & VM_FAULT_LOCKED))) + lock_page(vmf->page); + else + VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); + ret |= finish_fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) { -- 2.30.0.352.gf6a05684e6
From 7cf5ad81641214d2c4bb6404b64436a787791749 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> Date: Tue, 13 Oct 2020 11:00:33 -0700 Subject: [PATCH 3/3] mm: do_cow_fault() does not need the source page to be locked This removes the "lock if it wasn't locked" logic from do_cow_fault(), since we're not even going to install that page into the destination address space (finish_fault() will use ->cow_page rather than ->page), and copying the source page does not need the source to be locked. So instead of doing "lock if it wasn't locked" followed by an unconditional unlock of the page, just do "unlock if it was locked". Of course, since all the normal file mapping ->fault() handlers currently lock the page they return (see filemap_fault() for details), all of this is pretty much theoretical. But this is the right thing to do - making sure we hold the page lock when we really don't is just confusing and wrong. And this prepares the way for any future changes to filemap_fault() where we go "Oh, we actually _don't_ need to lock the page for this case at all". Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> --- mm/memory.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index d7e30273bef1..44653b5b2af6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3971,16 +3971,13 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf) if (ret & VM_FAULT_DONE_COW) return ret; - if (unlikely(!(ret & VM_FAULT_LOCKED))) - lock_page(vmf->page); - else - VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); - copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); __SetPageUptodate(vmf->cow_page); + if (ret & VM_FAULT_LOCKED) + unlock_page(vmf->page); + ret |= finish_fault(vmf); - unlock_page(vmf->page); put_page(vmf->page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) goto uncharge_out; -- 2.30.0.352.gf6a05684e6