On 01/12/2017 11:43 AM, Mel Gorman wrote:
Many workloads that allocate pages are not handling an interrupt at a time. As allocation requests may be from IRQ context, it's necessary to disable/enable IRQs for every page allocation. This cost is the bulk of the free path but also a significant percentage of the allocation path. This patch alters the locking and checks such that only irq-safe allocation requests use the per-cpu allocator. All others acquire the irq-safe zone->lock and allocate from the buddy allocator. It relies on disabling preemption to safely access the per-cpu structures. It could be slightly modified to avoid soft IRQs using it but it's not clear it's worthwhile. This modification may slow allocations from IRQ context slightly but the main gain from the per-cpu allocator is that it scales better for allocations from multiple contexts. There is an implicit assumption that intensive allocations from IRQ contexts on multiple CPUs from a single NUMA node are rare and that the fast majority of scaling issues are encountered in !IRQ contexts such as page faulting. It's worth noting that this patch is not required for a bulk page allocator but it significantly reduces the overhead. The following is results from a page allocator micro-benchmark. Only order-0 is interesting as higher orders do not use the per-cpu allocator
<snip nice results>
Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> Acked-by: Hillf Danton <hillf.zj@xxxxxxxxxxxxxxx> Acked-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx>
Very promising! But I have some worries. Should we put something like VM_BUG_ON(in_interrupt()) into free_hot_cold_page() and rmqueue_pcplist() to catch future potential misuses and also document this requirement? Also free_hot_cold_page() has other call sites besides __free_pages() and I'm not sure if those are all guaranteed to be !IRQ? E.g. free_hot_cold_page_list() which is called by release_page() which uses irq-safe lock operations...
Smaller nit below:
@@ -2453,8 +2450,8 @@ void free_hot_cold_page(struct page *page, bool cold) migratetype = get_pfnblock_migratetype(page, pfn); set_pcppage_migratetype(page, migratetype); - local_irq_save(flags); - __count_vm_event(PGFREE); + preempt_disable(); + count_vm_event(PGFREE);
AFAICS preempt_disable() is enough for using __count_vm_event(), no?
@@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, struct list_head *list; bool cold = ((gfp_flags & __GFP_COLD) != 0); struct page *page; - unsigned long flags; - local_irq_save(flags); + preempt_disable(); pcp = &this_cpu_ptr(zone->pageset)->pcp; list = &pcp->lists[migratetype]; page = __rmqueue_pcplist(zone, order, gfp_flags, migratetype, @@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
But if I'm wrong above, then this __count should be converted too?
zone_statistics(preferred_zone, zone, gfp_flags); } - local_irq_restore(flags); + preempt_enable_no_resched(); return page; }
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>