Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

David Hildenbrand <david@xxxxxxxxxx> · Wed, 31 Jan 2024 11:16:01 +0100

On 31.01.24 03:20, Yin Fengwei wrote:
On 1/29/24 22:32, David Hildenbrand wrote:
This series is based on [1] and must be applied on top of it.
Similar to what we did with fork(), let's implement PTE batching
during unmap/zap when processing PTE-mapped THPs.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
entry removal once per batch.

Ryan was previously working on this in the context of cont-pte for
arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
This series implements the optimization for all architectures, independent
of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
large-folio-pages batches as well, and amkes use of our new rmap batching
function when removing the rmap.

To achieve that, we have to enlighten MMU gather / page freeing code
(i.e., everything that consumes encoded_page) to process unmapping
of consecutive pages that all belong to the same large folio. I'm being
very careful to not degrade order-0 performance, and it looks like I
managed to achieve that.

Let's CC Linus and Michal to make sure I'm not daydreaming.

Relevant patch:
  https://lkml.kernel.org/r/20240129143221.263763-8-david@xxxxxxxxxx

Context: I'm adjusting MMU gather code to support batching of 
consecutive pages that belong to the same large folio, when 
unmapping/zapping PTEs.

For small folios, there is no (relevant) change.

Imagine we have a PTE-mapped THP (2M folio -> 512 pages) and zap all 512 
PTEs: Instead of adding 512 individual encoded_page entries, we add a 
combined entry that expresses "page+nr_pages". That allows for "easily" 
adding various other per-folio batching (refcount, rmap, swap freeing).

The implication is, that we can now batch effective more pages with 
large folios, exceeding the old 10000 limit. The number of involved 
*folios* does not increase, though.

One possible scenario:
If all the folio is 2M size folio, then one full batch could hold 510M memory.
Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
memory?

Excellent point, I think there are three parts to it:

(1) Batch pages / folio fragments per batch page

Before this change (and with 4k folios) we have exactly one page (4k) 
per encoded_page entry in the batch. Now, we can have (with 2M folios), 
512 pages for every two encoded_page entries (page+nr_pages) in a batch 
page. So an average ~256 pages per encoded_page entry.

So one batch page can now store in the worst case ~256 times the number 
of pages, but the number of folio fragments ("pages+nr_pages") would not 
increase.

The time it takes to perform the actual page freeing of a batch will not 
be 256 times higher -- the time is expected to be much closer to the old 
time (i.e., not freeing more folios).

(2) Delayed rmap handling

We limit batching early (see tlb_next_batch()) when we have delayed rmap 
pending. Reason being, that we don't want to check for many entries if 
they require delayed rmap handling, while still holding the page table 
lock (see tlb_flush_rmaps()), because we have to remove the rmap before 
dropping the PTL.

Note that we perform the check whether we need delayed rmap handling per 
page+nr_pages entry, not per page. So we won't perform more such checks.

Once we set tlb->delayed_rmap (because we add one entry that requires 
it), we already force a flush before dropping the PT lock. So once we 
get a single delayed rmap entry in there, we will not batch more than we 
could have in the same page table: so not more than 512 entries (x86-64) 
in the worst case. So it will still be bounded, and not significantly 
more than what we had before.

So regarding delayed rmap handling I think this should be fine.

(3) Total patched pages

MAX_GATHER_BATCH_COUNT effectively limits the number of pages we 
allocate (full batches), and thereby limits the number of pages we were 
able to batch.

The old limit was ~10000 pages, now we could batch ~5000 folio fragments 
(page+nr_pages), resulting int the "times 256" increase in the worst 
case on x86-64 as you point out.

This 10000 pages limit was introduced in 53a59fc67f97 ("mm: limit 
mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT") where we 
wanted to handle soft-lockups.

As the number of effective folios we are freeing does not increase, I 
*think* this should be fine.

If any of that is a problem, we would have to keep track of the total 
number of pages in our batch, and stop as soon as we hit our 10000 limit 
-- independent of page vs. folio fragment. Something I would like to 
avoid of possible.

--
Cheers,

David / dhildenb