On 1/29/24 22:32, David Hildenbrand wrote: > This series is based on [1] and must be applied on top of it. > Similar to what we did with fork(), let's implement PTE batching > during unmap/zap when processing PTE-mapped THPs. > > We collect consecutive PTEs that map consecutive pages of the same large > folio, making sure that the other PTE bits are compatible, and (a) adjust > the refcount only once per batch, (b) call rmap handling functions only > once per batch, (c) perform batch PTE setting/updates and (d) perform TLB > entry removal once per batch. > > Ryan was previously working on this in the context of cont-pte for > arm64, int latest iteration [2] with a focus on arm6 with cont-pte only. > This series implements the optimization for all architectures, independent > of such PTE bits, teaches MMU gather/TLB code to be fully aware of such > large-folio-pages batches as well, and amkes use of our new rmap batching > function when removing the rmap. > > To achieve that, we have to enlighten MMU gather / page freeing code > (i.e., everything that consumes encoded_page) to process unmapping > of consecutive pages that all belong to the same large folio. I'm being > very careful to not degrade order-0 performance, and it looks like I > managed to achieve that. One possible scenario: If all the folio is 2M size folio, then one full batch could hold 510M memory. Is it too much regarding one full batch before just can hold (2M - 4096 * 2) memory? Regards Yin, Fengwei