Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests

Jesper Dangaard Brouer <brouer@xxxxxxxxxx> · Wed, 4 Jan 2017 15:20:24 +0100

On Wed,  4 Jan 2017 11:10:48 +0000 Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:

> Many workloads that allocate pages are not handling an interrupt at a
> time. As allocation requests may be from IRQ context, it's necessary to
> disable/enable IRQs for every page allocation. This cost is the bulk
> of the free path but also a significant percentage of the allocation
> path.
> 
> This patch alters the locking and checks such that only irq-safe allocation
> requests use the per-cpu allocator. All others acquire the irq-safe
> zone->lock and allocate from the buddy allocator. It relies on disabling
> preemption to safely access the per-cpu structures. 

I love this idea and patch :-)

> It could be slightly
> modified to avoid soft IRQs using it but it's not clear it's worthwhile.

NICs usually refill their RX-ring from SoftIRQ context (NAPI).
Thus, we do want this optimization to work in softirq.

> This modification may slow allocations from IRQ context slightly but the main
> gain from the per-cpu allocator is that it scales better for allocations
> from multiple contexts. There is an implicit assumption that intensive
> allocations from IRQ contexts on multiple CPUs from a single NUMA node are
> rare and that the fast majority of scaling issues are encountered in !IRQ
> contexts such as page faulting. 

IHMO, I agree with this implicit assumption.

> It's worth noting that this patch is not
> required for a bulk page allocator but it significantly reduces the overhead.
> 
> The following is results from a page allocator micro-benchmark. Only
> order-0 is interesting as higher orders do not use the per-cpu allocator

I'm seeing approx 34% reduction in a order-0 micro-benchmark! amazing! :-)
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/

>                                           4.10.0-rc2                 4.10.0-rc2
>                                              vanilla               irqsafe-v1r5
> Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
> Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
> Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
> Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
> Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
> Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
> Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
> Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
> Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
> Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
> Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
> Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
> Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
> Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
> Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
> Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
> Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
> Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
> Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
> Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
> Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
> Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
> Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
> Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
> Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
> Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
> Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
> Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
> Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
> Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
> Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
> Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
> Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
> Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
> Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
> Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
> Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
> Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
> Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
> Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
> Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
> Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
> Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
> Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
> Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
> 
> This is the alloc, free and total overhead of allocating order-0 pages in
> batches of 1 page up to 16384 pages. Avoiding disabling/enabling overhead
> massively reduces overhead. Alloc overhead is roughly reduced by 14-20% in
> most cases. The free path is reduced by 26-46% and the total reduction
> is significant.
> 
[...]
> 
> Similarly, little benefit was seen on networking benchmarks both localhost
> and between physical server/clients where other costs dominate. It's
> possible that this will only be noticable on very high speed networks.

The networking results highly depend on NIC drivers.  As you mention in
the cover-letter, (1) some drivers (e.g mlx4) alloc high-order pages to
work-around order-0 pages and DMA-map being too slow (for their HW
use-case), (2) drivers that do use order-0 pages have driver specific
page-recycling tricks (e.g. mlx5 and ixgbe).  The page_pool target
making a more generic recycle mechanism for drivers to use.

I'm very excited to see improvements in this area! :-)))
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>