On 2/17/23 13:05, Alexander Halbuer wrote: > As proposed by Vlastimil Babka [1] this is an extension to the previous patch > "mm: reduce lock contention of pcp buffer refill". This version also moves the > is_migrate_cma() check from the critical section to the loop below. Hi, thanks for trying it out! > The optimization has several advantages: > - Less time in critical section > - Batching update of NR_FREE_CMA_PAGES into a single call to > mod_zone_page_state() > - Utilizing cache locality of the related struct page when doing the cma check > is_migrate_cma() and the sanity check check_pcp_refill() in the same loop The last point probably doesn't apply as we are reading/modifying page->lru with __rmqueue()/list_add_tail() so that brings the struct page to cache already. > > However, this change only affects performance with CONFIG_CMA=true which > may not be the common case. Another possibly negative effect is that the > actual update of NR_FREE_CMA_PAGES is delayed beyond the release of the > zone lock resulting in a short time span of inaccuracy between the > counter and the actual number of cma pages in the zone. Hmm didn't realize that, that might be perhaps a problem. > The tables below compare this patch with the initial one using a > parallel allocation benchmark. The used kernel config is based on the default > debian config but with CONFIG_INIT_ON_ALLOC_DEFAULT_ON=FALSE and > CONFIG_CMA=TRUE. The benchmarks have been performed with the default sanity > checks enabled as the patch "mm, page_alloc: reduce page alloc/free sanity > checks" [2] was not enabled on my test branch. > The given times are average allocation times. The improvement is not > significant, but the general trend is visible. Yeah there's some improvement, but if [2] is accepted, then keeping two loops there just for the cma update (as there will be no more checking in the second loop) will almost certainly stop being a win. And with the risk of inaccuracy you pointed out, on top. Incidentally, did you observe any improvements by [2] with your test, especially as the batch freeing side also no longer does checking under zone lock? Thanks! > Hopefully, without sanity checks and disabled cma, a compiler will be able > to optimize away the second loop entirely. > > [1] https://lore.kernel.org/lkml/1d468148-936f-8816-eb71-1662f2d4945b@xxxxxxx/ > [2] https://lore.kernel.org/linux-mm/20230216095131.17336-1-vbabka@xxxxxxx/ > > Normal Pages > +-------+---------+---------+---------+ > | Cores | Patch 1 | Patch 2 | Patch 2 | > | | (ns) | (ns) | Diff | > +-------+---------+---------+---------+ > | 1 | 132 | 129 | (-2.3%) | > | 2 | 147 | 145 | (-1.4%) | > | 3 | 148 | 147 | (-0.7%) | > | 4 | 175 | 173 | (-1.1%) | > | 6 | 263 | 255 | (-3.0%) | > | 8 | 347 | 337 | (-2.9%) | > | 10 | 432 | 421 | (-2.5%) | > | 12 | 516 | 505 | (-2.1%) | > | 14 | 604 | 590 | (-2.3%) | > | 16 | 695 | 680 | (-2.2%) | > | 20 | 869 | 844 | (-2.9%) | > | 24 | 1043 | 1015 | (-2.7%) | > +-------+---------+---------+---------+ > > Huge Pages > +-------+---------+---------+---------+ > | Cores | Patch 1 | Patch 2 | Patch 2 | > | | (ns) | (ns) | Diff | > +-------+---------+---------+---------+ > | 1 | 3177 | 3133 | (-1.4%) | > | 2 | 3486 | 3471 | (-0.4%) | > | 3 | 3644 | 3634 | (-0.3%) | > | 4 | 3669 | 3643 | (-0.7%) | > | 6 | 3587 | 3578 | (-0.3%) | > | 8 | 3635 | 3621 | (-0.4%) | > | 10 | 4015 | 3960 | (-1.4%) | > | 12 | 4681 | 4510 | (-3.7%) | > | 14 | 5398 | 5180 | (-4.0%) | > | 16 | 6239 | 5891 | (-5.6%) | > | 20 | 7864 | 7435 | (-5.5%) | > | 24 | 9011 | 8971 | (-0.4%) | > +-------+---------+---------+---------+ > > Reported-by: Vlastimil Babka <vbabka@xxxxxxx> > Signed-off-by: Alexander Halbuer <halbuer@xxxxxxxxxxxxxxxxxxx> > --- > mm/page_alloc.c | 48 ++++++++++++++++++++++++++++++++++++------------ > 1 file changed, 36 insertions(+), 12 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 0745aedebb37..f82a59eeb4fe 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3119,17 +3119,17 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > { > unsigned long flags; > int i, allocated = 0; > + int cma_pages = 0; > + struct list_head *prev_tail = list->prev; > + struct page *pos, *n; > > spin_lock_irqsave(&zone->lock, flags); > for (i = 0; i < count; ++i) { > - struct page *page = __rmqueue(zone, order, migratetype, > - alloc_flags); > + struct page *page = > + __rmqueue(zone, order, migratetype, alloc_flags); > if (unlikely(page == NULL)) > break; > > - if (unlikely(check_pcp_refill(page, order))) > - continue; > - > /* > * Split buddy pages returned by expand() are received here in > * physical page order. The page is added to the tail of > @@ -3141,20 +3141,44 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > * pages are ordered properly. > */ > list_add_tail(&page->pcp_list, list); > - allocated++; > - if (is_migrate_cma(get_pcppage_migratetype(page))) > - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, > - -(1 << order)); > } > > /* > - * i pages were removed from the buddy list even if some leak due > - * to check_pcp_refill failing so adjust NR_FREE_PAGES based > - * on i. Do not confuse with 'allocated' which is the number of > + * i pages were removed from the buddy list so adjust NR_FREE_PAGES > + * based on i. Do not confuse with 'allocated' which is the number of > * pages added to the pcp list. > */ > __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); > + > spin_unlock_irqrestore(&zone->lock, flags); > + > + /* > + * Pages are appended to the pcp list without checking to reduce the > + * time holding the zone lock. Checking the appended pages happens right > + * after the critical section while still holding the pcp lock. > + */ > + pos = list_first_entry(prev_tail, struct page, pcp_list); > + list_for_each_entry_safe_from(pos, n, list, pcp_list) { > + /* > + * Count number of cma pages to batch update of > + * NR_FREE_CMA_PAGES into a single function call. > + */ > + if (is_migrate_cma(get_pcppage_migratetype(pos))) > + cma_pages++; > + > + if (unlikely(check_pcp_refill(pos, order))) { > + list_del(&pos->pcp_list); > + continue; > + } > + > + allocated++; > + } > + > + if (cma_pages > 0) { > + mod_zone_page_state(zone, NR_FREE_CMA_PAGES, > + -(cma_pages << order)); > + } > + > return allocated; > } >