On 10/08/2023 15:44, Zi Yan wrote: > On 10 Aug 2023, at 6:33, Ryan Roberts wrote: > >> mmu_gather accumulates a set of pages into a buffer for later rmap >> removal and freeing. Page pointers were previously stored in a "linked >> list of arrays", then at flush time, each page in the buffer was removed >> from the rmap, removed from the swapcache and its refcount was >> decremented; if the refcount reached 0, then it was freed. >> >> With increasing numbers of large folios (or at least contiguous parts of >> large folios) mapped into userspace processes (pagecache pages for >> supporting filesystems currently, but in future also large anonymous >> folios), we can measurably improve performance of process teardown: >> >> - For rmap removal, we can batch-remove a range of pages belonging to >> the same folio with folio_remove_rmap_range(), which is more efficient >> because atomics can be manipulated just once per range. In the common >> case, it also allows us to elide adding the (anon) folio to the >> deferred split queue, only to remove it a bit later, once all pages of >> the folio have been removed fro mthe rmap. >> >> - For swapcache removal, we only need to check and remove the folio from >> the swap cache once, rather than trying for each individual page. >> >> - For page release, we can batch-decrement the refcount for each page in >> the folio and free it if it hits zero. >> >> Change the page pointer storage format within the mmu_gather batch >> structure to store "folio_range"s; a [start, end) page pointer pair. >> This allows us to run length encode a contiguous range of pages that all >> belong to the same folio. This likely allows us to improve cache >> locality a bit. But it also gives us a convenient format for >> implementing the above 3 optimizations. >> >> Of course if running on a system that does not extensively use large >> pte-mapped folios, then the RLE approach uses twice as much memory, >> because each range is 1 page long and uses 2 pointers. But performance >> measurements show no impact in terms of performance. >> >> Macro Performance Results >> ------------------------- >> >> Test: Timed kernel compilation on Ampere Altra (arm64), 80 jobs >> Configs: Comparing with and without large anon folios >> >> Without large anon folios: >> | kernel | real-time | kern-time | user-time | >> |:-----------------|------------:|------------:|------------:| >> | baseline-laf-off | 0.0% | 0.0% | 0.0% | >> | mmugather-range | -0.3% | -0.3% | 0.1% | >> >> With large anon folios (order-3): >> | kernel | real-time | kern-time | user-time | >> |:-----------------|------------:|------------:|------------:| >> | baseline-laf-on | 0.0% | 0.0% | 0.0% | >> | mmugather-range | -0.7% | -3.9% | -0.1% | >> >> Test: Timed kernel compilation in VM on Apple M2 MacBook Pro, 8 jobs >> Configs: Comparing with and without large anon folios >> >> Without large anon folios: >> | kernel | real-time | kern-time | user-time | >> |:-----------------|------------:|------------:|------------:| >> | baseline-laf-off | 0.0% | 0.0% | 0.0% | >> | mmugather-range | -0.9% | -2.9% | -0.6% | >> >> With large anon folios (order-3): >> | kernel | real-time | kern-time | user-time | >> |:-----------------|------------:|------------:|------------:| >> | baseline-laf-on | 0.0% | 0.0% | 0.0% | >> | mmugather-range | -0.4% | -3.7% | -0.2% | >> >> Micro Performance Results >> ------------------------- >> >> Flame graphs for kernel compilation on Ampere Altra show reduction in >> cycles consumed by __arm64_sys_exit_group syscall: >> >> Without large anon folios: -2% >> With large anon folios: -26% >> >> For the large anon folios case, it also shows a big difference in cost >> of rmap removal: >> >> baseline: cycles in page_remove_rmap(): 24.7B >> mmugather-range: cycles in folio_remove_rmap_range(): 5.5B >> >> Furthermore, the baseline shows 5.2B cycles used by >> deferred_split_folio() which has completely disappeared after >> applying this series. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >> --- >> include/asm-generic/tlb.h | 7 +-- >> include/linux/mm.h | 7 +++ >> include/linux/swap.h | 6 +-- >> mm/mmu_gather.c | 56 ++++++++++++++++++++---- >> mm/swap.c | 91 +++++++++++++++++++++++++++++++++++++++ >> mm/swap_state.c | 11 ++--- >> 6 files changed, 158 insertions(+), 20 deletions(-) >> >> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h >> index d874415aaa33..fe300a64e59d 100644 >> --- a/include/asm-generic/tlb.h >> +++ b/include/asm-generic/tlb.h >> @@ -246,11 +246,11 @@ struct mmu_gather_batch { >> struct mmu_gather_batch *next; >> unsigned int nr; >> unsigned int max; >> - struct page *pages[]; >> + struct folio_range ranges[]; >> }; >> >> #define MAX_GATHER_BATCH \ >> - ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *)) >> + ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(struct folio_range)) >> >> /* >> * Limit the maximum number of mmu_gather batches to reduce a risk of soft >> @@ -342,7 +342,8 @@ struct mmu_gather { >> #ifndef CONFIG_MMU_GATHER_NO_GATHER >> struct mmu_gather_batch *active; >> struct mmu_gather_batch local; >> - struct page *__pages[MMU_GATHER_BUNDLE]; >> + struct folio_range __ranges[MMU_GATHER_BUNDLE]; >> + struct page *range_limit; >> struct mmu_gather_batch *rmap_pend; >> unsigned int rmap_pend_first; >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 914e08185272..f86c905a065d 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1513,6 +1513,13 @@ static inline void folio_put_refs(struct folio *folio, int refs) >> __folio_put(folio); >> } >> >> +struct folio_range { >> + struct page *start; >> + struct page *end; >> +}; > > I see end is used for calculating nr_pages multiple times below. Maybe just > use nr_pages instead of end here. But then I'd need to calculate end (= start + nr_pages) every time __tlb_remove_page() is called to figure out if the page being removed is the next contiguous page in the current range. __tlb_remove_page() gets called for every page, but the current way I do it, I only calculate nr_pages once per range. So I think my way is more efficient? > > Also, struct page (memmap) might not be always contiguous, using struct page > points to represent folio range might not give the result you want. > See nth_page() and folio_page_idx() in include/linux/mm.h. Is that true for pages within the same folio too? Or are all pages in a folio guarranteed contiguous? Perhaps I'm better off using pfn? > >> + >> +void folios_put_refs(struct folio_range *folios, int nr); >> + >> /* >> * union release_pages_arg - an array of pages or folios >> * >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index f199df803b33..06a7cf3ad6c9 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -453,7 +453,7 @@ static inline unsigned long total_swapcache_pages(void) >> >> extern void free_swap_cache(struct page *page); >> extern void free_page_and_swap_cache(struct page *); >> -extern void free_pages_and_swap_cache(struct page **, int); >> +extern void free_folios_and_swap_cache(struct folio_range *, int); >> /* linux/mm/swapfile.c */ >> extern atomic_long_t nr_swap_pages; >> extern long total_swap_pages; >> @@ -530,8 +530,8 @@ static inline void put_swap_device(struct swap_info_struct *si) >> * so leave put_page and release_pages undeclared... */ >> #define free_page_and_swap_cache(page) \ >> put_page(page) >> -#define free_pages_and_swap_cache(pages, nr) \ >> - release_pages((pages), (nr)); >> +#define free_folios_and_swap_cache(folios, nr) \ >> + folios_put_refs((folios), (nr)) >> >> /* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */ >> #define free_swap_and_cache(e) is_pfn_swap_entry(e) >> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c >> index 5d100ac85e21..fd2ea7577817 100644 >> --- a/mm/mmu_gather.c >> +++ b/mm/mmu_gather.c >> @@ -22,6 +22,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb) >> batch = tlb->active; >> if (batch->next) { >> tlb->active = batch->next; >> + tlb->range_limit = NULL; >> return true; >> } >> >> @@ -39,6 +40,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb) >> >> tlb->active->next = batch; >> tlb->active = batch; >> + tlb->range_limit = NULL; >> >> return true; >> } >> @@ -49,9 +51,11 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, >> struct vm_area_struct *vma) >> { >> for (int i = first; i < batch->nr; i++) { >> - struct page *page = batch->pages[i]; >> + struct folio_range *range = &batch->ranges[i]; >> + int nr = range->end - range->start; >> + struct folio *folio = page_folio(range->start); >> >> - page_remove_rmap(page, vma, false); >> + folio_remove_rmap_range(folio, range->start, nr, vma); >> } >> } >> >> @@ -75,6 +79,11 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) >> for (batch = batch->next; batch && batch->nr; batch = batch->next) >> tlb_flush_rmap_batch(batch, 0, vma); >> >> + /* >> + * Move to the next range on next page insertion to prevent any future >> + * pages from being accumulated into the range we just did the rmap for. >> + */ >> + tlb->range_limit = NULL; >> tlb_discard_rmaps(tlb); >> } >> >> @@ -94,7 +103,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb) >> struct mmu_gather_batch *batch; >> >> for (batch = &tlb->local; batch && batch->nr; batch = batch->next) { >> - struct page **pages = batch->pages; >> + struct folio_range *ranges = batch->ranges; >> >> do { >> /* >> @@ -102,14 +111,15 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb) >> */ >> unsigned int nr = min(512U, batch->nr); >> >> - free_pages_and_swap_cache(pages, nr); >> - pages += nr; >> + free_folios_and_swap_cache(ranges, nr); >> + ranges += nr; >> batch->nr -= nr; >> >> cond_resched(); >> } while (batch->nr); >> } >> tlb->active = &tlb->local; >> + tlb->range_limit = NULL; >> tlb_discard_rmaps(tlb); >> } >> >> @@ -127,6 +137,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb) >> bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size) >> { >> struct mmu_gather_batch *batch; >> + struct folio_range *range; >> >> VM_BUG_ON(!tlb->end); >> >> @@ -135,11 +146,37 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_ >> #endif >> >> batch = tlb->active; >> + range = &batch->ranges[batch->nr - 1]; >> + >> + /* >> + * If there is a range being accumulated, add the page to the range if >> + * its contiguous, else start the next range. range_limit is always NULL >> + * when nr is 0, which protects the batch->ranges[-1] case. >> + */ >> + if (tlb->range_limit && page == range->end) { >> + range->end++; >> + } else { >> + struct folio *folio = page_folio(page); >> + >> + range = &batch->ranges[batch->nr++]; >> + range->start = page; >> + range->end = page + 1; >> + >> + tlb->range_limit = &folio->page + folio_nr_pages(folio); >> + } >> + >> + /* >> + * If we have reached the end of the folio, move to the next range when >> + * we add the next page; Never span multiple folios in the same range. >> + */ >> + if (range->end == tlb->range_limit) >> + tlb->range_limit = NULL; >> + >> /* >> - * Add the page and check if we are full. If so >> - * force a flush. >> + * Check if we are full. If so force a flush. In order to ensure we >> + * always have a free range for the next added page, the last range in a >> + * batch always only has a single page. >> */ >> - batch->pages[batch->nr++] = page; >> if (batch->nr == batch->max) { >> if (!tlb_next_batch(tlb)) >> return true; >> @@ -318,8 +355,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, >> tlb->need_flush_all = 0; >> tlb->local.next = NULL; >> tlb->local.nr = 0; >> - tlb->local.max = ARRAY_SIZE(tlb->__pages); >> + tlb->local.max = ARRAY_SIZE(tlb->__ranges); >> tlb->active = &tlb->local; >> + tlb->range_limit = NULL; >> tlb->batch_count = 0; >> tlb->rmap_pend = &tlb->local; >> tlb->rmap_pend_first = 0; >> diff --git a/mm/swap.c b/mm/swap.c >> index b05cce475202..e238d3623fcb 100644 >> --- a/mm/swap.c >> +++ b/mm/swap.c >> @@ -1041,6 +1041,97 @@ void release_pages(release_pages_arg arg, int nr) >> } >> EXPORT_SYMBOL(release_pages); >> >> +/** >> + * folios_put_refs - batched folio_put_refs() >> + * @folios: array of `struct folio_range`s to release >> + * @nr: number of folio ranges >> + * >> + * Each `struct folio_range` describes the start and end page of a range within >> + * a folio. The folio reference count is decremented once for each page in the >> + * range. If it fell to zero, remove the page from the LRU and free it. >> + */ >> +void folios_put_refs(struct folio_range *folios, int nr) >> +{ >> + int i; >> + LIST_HEAD(pages_to_free); >> + struct lruvec *lruvec = NULL; >> + unsigned long flags = 0; >> + unsigned int lock_batch; >> + >> + for (i = 0; i < nr; i++) { >> + struct folio *folio = page_folio(folios[i].start); >> + int refs = folios[i].end - folios[i].start; >> + >> + /* >> + * Make sure the IRQ-safe lock-holding time does not get >> + * excessive with a continuous string of pages from the >> + * same lruvec. The lock is held only if lruvec != NULL. >> + */ >> + if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) { >> + unlock_page_lruvec_irqrestore(lruvec, flags); >> + lruvec = NULL; >> + } >> + >> + if (is_huge_zero_page(&folio->page)) >> + continue; >> + >> + if (folio_is_zone_device(folio)) { >> + if (lruvec) { >> + unlock_page_lruvec_irqrestore(lruvec, flags); >> + lruvec = NULL; >> + } >> + if (put_devmap_managed_page(&folio->page)) >> + continue; >> + if (folio_put_testzero(folio)) >> + free_zone_device_page(&folio->page); >> + continue; >> + } >> + >> + if (!folio_ref_sub_and_test(folio, refs)) >> + continue; >> + >> + if (folio_test_large(folio)) { >> + if (lruvec) { >> + unlock_page_lruvec_irqrestore(lruvec, flags); >> + lruvec = NULL; >> + } >> + __folio_put_large(folio); >> + continue; >> + } >> + >> + if (folio_test_lru(folio)) { >> + struct lruvec *prev_lruvec = lruvec; >> + >> + lruvec = folio_lruvec_relock_irqsave(folio, lruvec, >> + &flags); >> + if (prev_lruvec != lruvec) >> + lock_batch = 0; >> + >> + lruvec_del_folio(lruvec, folio); >> + __folio_clear_lru_flags(folio); >> + } >> + >> + /* >> + * In rare cases, when truncation or holepunching raced with >> + * munlock after VM_LOCKED was cleared, Mlocked may still be >> + * found set here. This does not indicate a problem, unless >> + * "unevictable_pgs_cleared" appears worryingly large. >> + */ >> + if (unlikely(folio_test_mlocked(folio))) { >> + __folio_clear_mlocked(folio); >> + zone_stat_sub_folio(folio, NR_MLOCK); >> + count_vm_event(UNEVICTABLE_PGCLEARED); >> + } >> + >> + list_add(&folio->lru, &pages_to_free); >> + } >> + if (lruvec) >> + unlock_page_lruvec_irqrestore(lruvec, flags); >> + >> + mem_cgroup_uncharge_list(&pages_to_free); >> + free_unref_page_list(&pages_to_free); >> +} >> + >> /* >> * The folios which we're about to release may be in the deferred lru-addition >> * queues. That would prevent them from really being freed right now. That's >> diff --git a/mm/swap_state.c b/mm/swap_state.c >> index 73b16795b0ff..526bbd5a2ce1 100644 >> --- a/mm/swap_state.c >> +++ b/mm/swap_state.c >> @@ -304,15 +304,16 @@ void free_page_and_swap_cache(struct page *page) >> } >> >> /* >> - * Passed an array of pages, drop them all from swapcache and then release >> - * them. They are removed from the LRU and freed if this is their last use. >> + * Passed an array of folio ranges, drop all folios from swapcache and then put >> + * a folio reference for each page in the range. They are removed from the LRU >> + * and freed if this is their last use. >> */ >> -void free_pages_and_swap_cache(struct page **pages, int nr) >> +void free_folios_and_swap_cache(struct folio_range *folios, int nr) >> { >> lru_add_drain(); >> for (int i = 0; i < nr; i++) >> - free_swap_cache(pages[i]); >> - release_pages(pages, nr); >> + free_swap_cache(folios[i].start); >> + folios_put_refs(folios, nr); >> } >> >> static inline bool swap_use_vma_readahead(void) >> -- >> 2.25.1 > > > -- > Best Regards, > Yan, Zi