Re: [linux-next:master] [mm] 5df397dec7: will-it-scale.per_thread_ops -53.3% regression

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 6 Dec 2022 11:15:09 -0800

On Tue, Dec 6, 2022 at 10:41 AM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> Let me think about this a while, but I think I'll have a patch for you
> to test once I've dealt with a couple more pull requests.

So here's a trial balloon for you to try if you can see if this mostly
fixes the regression..

It still limits batching (because unlike the full "gather pages until
you have to flush", this is all batched under the page table lock. But
it limits it a bit less, in that it will use a second active batch if
it only used the initial on-stack one (which is called "local", which
is not a great name in this context, but whatever).

This _should_ mean that that benchmark will now batch ~512 pages
instead of just 8.

Which should be pretty much what it effectively used to do before too,
because the dirty shared page case has always caused that
"force_flush" thing, so it will have always stopped to flush every
page directory.

(But we still have that extra rmap flushing limit because there could
have been _previous_ buffered page pointers that weren't dirty shared
pages, and we don't want to have to deal with that pain, and might
have to exit early in order to avoid it)

I can imagine cleaner ways to do this, but they would involve having
to remember which batch we started having dirty pages in, which is
more bookkeeping pain than I really think it's worth.

Does this fix the regression?

             Linus
 mm/mmu_gather.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 8247553a69c2..2b93cf6ac9ae 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -19,8 +19,8 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
 
-	/* No more batching if we have delayed rmaps pending */
-	if (tlb->delayed_rmap)
+	/* Limit batching if we have delayed rmaps pending */
+	if (tlb->delayed_rmap && tlb->active != &tlb->local)
 		return false;
 
 	batch = tlb->active;
@@ -48,31 +48,35 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 }
 
 #ifdef CONFIG_SMP
+static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_struct *vma)
+{
+	for (int i = 0; i < batch->nr; i++) {
+		struct encoded_page *enc = batch->encoded_pages[i];
+
+		if (encoded_page_flags(enc)) {
+			struct page *page = encoded_page_ptr(enc);
+			page_remove_rmap(page, vma, false);
+		}
+	}
+}
+
 /**
  * tlb_flush_rmaps - do pending rmap removals after we have flushed the TLB
  * @tlb: the current mmu_gather
  *
  * Note that because of how tlb_next_batch() above works, we will
- * never start new batches with pending delayed rmaps, so we only
- * need to walk through the current active batch.
+ * never start multiple new batches with pending delayed rmaps, so
+ * we only need to walk through the current active batch and the
+ * original local one.
  */
 void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
-	struct mmu_gather_batch *batch;
-
 	if (!tlb->delayed_rmap)
 		return;
 
-	batch = tlb->active;
-	for (int i = 0; i < batch->nr; i++) {
-		struct encoded_page *enc = batch->encoded_pages[i];
-
-		if (encoded_page_flags(enc)) {
-			struct page *page = encoded_page_ptr(enc);
-			page_remove_rmap(page, vma, false);
-		}
-	}
-
+	tlb_flush_rmap_batch(&tlb->local, vma);
+	if (tlb->active != &tlb->local)
+		tlb_flush_rmap_batch(tlb->active, vma);
 	tlb->delayed_rmap = 0;
 }
 #endif