On Mon, 27 Mar 2017 03:32:47 -0400 (EDT) Pankaj Gupta <pagupta@xxxxxxxxxx> wrote: > Hello, > > It looks like a race with softirq and normal process context. > > Just thinking if we really want allocations from 'softirqs' to be > done using per cpu list? Yes, softirq need fast page allocs. The softirq use-case is refilling the DMA RX rings, which is time critical, especially for NIC drivers. For this reason most drivers implement different page recycling tricks. > Or we can have some check in 'free_hot_cold_page' for softirqs > to check if we are on a path of returning from hard interrupt don't > allocate from per cpu list. A possible solution, would be use the local_bh_{disable,enable} instead of the {preempt_disable,enable} calls. But it is slower, using numbers from [1] (19 vs 11 cycles), thus the expected cycles saving is 38-19=19. The problematic part of using local_bh_enable is that this adds a softirq/bottom-halves rescheduling point (as it checks for pending BHs). Thus, this might affects real workloads. I'm unsure what the best option is. I'm leaning towards partly reverting[1] and go back to doing the slower local_irq_save + local_irq_restore as before. Afterwards we can add a bulk page alloc+free call, that can amortize this 38 cycles cost (of local_irq_{save,restore}). Or add a function call that MUST only be called from contexts with IRQs enabled, which allow using the unconditionally local_irq_{disable,enable} as it only costs 7 cycles. [1] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests") https://git.kernel.org/torvalds/c/374ad05ab64d > > On 26/03/2017 11:21 AM, Tariq Toukan wrote: > > > > > > > > > On 23/03/2017 4:51 PM, Mel Gorman wrote: > > >> On Thu, Mar 23, 2017 at 02:43:47PM +0100, Jesper Dangaard Brouer wrote: > > >>> On Wed, 22 Mar 2017 23:40:04 +0000 > > >>> Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote: > > >>> > > >>>> On Wed, Mar 22, 2017 at 07:39:17PM +0200, Tariq Toukan wrote: > > >>>>>>>> This modification may slow allocations from IRQ context slightly > > >>>>>>>> but the > > >>>>>>>> main gain from the per-cpu allocator is that it scales better for > > >>>>>>>> allocations from multiple contexts. There is an implicit > > >>>>>>>> assumption that > > >>>>>>>> intensive allocations from IRQ contexts on multiple CPUs from a > > >>>>>>>> single > > >>>>>>>> NUMA node are rare > > >>>>> Hi Mel, Jesper, and all. > > >>>>> > > >>>>> This assumption contradicts regular multi-stream traffic that is > > >>>>> naturally > > >>>>> handled > > >>>>> over close numa cores. I compared iperf TCP multistream (8 streams) > > >>>>> over CX4 (mlx5 driver) with kernels v4.10 (before this series) vs > > >>>>> kernel v4.11-rc1 (with this series). > > >>>>> I disabled the page-cache (recycle) mechanism to stress the page > > >>>>> allocator, > > >>>>> and see a drastic degradation in BW, from 47.5 G in v4.10 to 31.4 G in > > >>>>> v4.11-rc1 (34% drop). > > >>>>> I noticed queued_spin_lock_slowpath occupies 62.87% of CPU time. > > >>>> > > >>>> Can you get the stack trace for the spin lock slowpath to confirm it's > > >>>> from IRQ context? > > >>> > > >>> AFAIK allocations happen in softirq. Argh and during review I missed > > >>> that in_interrupt() also covers softirq. To Mel, can we use a in_irq() > > >>> check instead? > > >>> > > >>> (p.s. just landed and got home) > > > > > > Glad to hear. Thanks for your suggestion. > > > > > >> > > >> Not built or even boot tested. I'm unable to run tests at the moment > > > > > > Thanks Mel, I will test it soon. > > > > > Crashed in iperf single stream test: > > > > [ 3974.123386] ------------[ cut here ]------------ > > [ 3974.128778] WARNING: CPU: 2 PID: 8754 at lib/list_debug.c:53 > > __list_del_entry_valid+0xa3/0xd0 > > [ 3974.138751] list_del corruption. prev->next should be > > ffffea0040369c60, but was dead000000000100 > > [ 3974.149016] Modules linked in: netconsole nfsv3 nfs fscache dm_mirror > > dm_region_hash dm_log dm_mod sb_edac edac_core x86_pkg_temp_thermal > > coretemp i2c_diolan_u2c kvm irqbypass ipmi_si ipmi_devintf crc32_pclmul > > iTCO_wdt ghash_clmulni_intel ipmi_msghandler dcdbas iTCO_vendor_support > > sg pcspkr lpc_ich shpchp wmi mfd_core acpi_power_meter nfsd auth_rpcgss > > nfs_acl lockd grace sunrpc binfmt_misc ip_tables mlx4_en sr_mod cdrom > > sd_mod i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt > > fb_sys_fops mlx5_core ttm tg3 ahci libahci mlx4_core drm libata ptp > > megaraid_sas crc32c_intel i2c_core pps_core [last unloaded: netconsole] > > [ 3974.212743] CPU: 2 PID: 8754 Comm: iperf Not tainted 4.11.0-rc2+ #30 > > [ 3974.220073] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS > > 1.5.4 10/002/2015 > > [ 3974.228974] Call Trace: > > [ 3974.231925] <IRQ> > > [ 3974.234405] dump_stack+0x63/0x8c > > [ 3974.238355] __warn+0xd1/0xf0 > > [ 3974.241891] warn_slowpath_fmt+0x4f/0x60 > > [ 3974.246494] __list_del_entry_valid+0xa3/0xd0 > > [ 3974.251583] get_page_from_freelist+0x84c/0xb40 > > [ 3974.256868] ? napi_gro_receive+0x38/0x140 > > [ 3974.261666] __alloc_pages_nodemask+0xca/0x200 > > [ 3974.266866] mlx5e_alloc_rx_wqe+0x49/0x130 [mlx5_core] > > [ 3974.272862] mlx5e_post_rx_wqes+0x84/0xc0 [mlx5_core] > > [ 3974.278725] mlx5e_napi_poll+0xc7/0x450 [mlx5_core] > > [ 3974.284409] net_rx_action+0x23d/0x3a0 > > [ 3974.288819] __do_softirq+0xd1/0x2a2 > > [ 3974.293054] irq_exit+0xb5/0xc0 > > [ 3974.296783] do_IRQ+0x51/0xd0 > > [ 3974.300353] common_interrupt+0x89/0x89 > > [ 3974.304859] RIP: 0010:free_hot_cold_page+0x228/0x280 > > [ 3974.310629] RSP: 0018:ffffc9000ea07c90 EFLAGS: 00000202 ORIG_RAX: > > ffffffffffffffa8 > > [ 3974.319565] RAX: 0000000000000001 RBX: ffff88103f85f158 RCX: > > ffffea0040369c60 > > [ 3974.327764] RDX: ffffea0040369c60 RSI: ffff88103f85f168 RDI: > > ffffea0040369ca0 > > [ 3974.335961] RBP: ffffc9000ea07cc0 R08: ffff88103f85f168 R09: > > 00000000000005a8 > > [ 3974.344178] R10: 00000000000005a8 R11: 0000000000010468 R12: > > ffffea0040369c80 > > [ 3974.352387] R13: ffff88103f85f168 R14: ffff88107ffdeb80 R15: > > ffffea0040369ca0 > > [ 3974.360577] </IRQ> > > [ 3974.363145] __put_page+0x34/0x40 > > [ 3974.367068] skb_release_data+0xca/0xe0 > > [ 3974.371575] skb_release_all+0x24/0x30 > > [ 3974.375984] __kfree_skb+0x12/0x20 > > [ 3974.380003] tcp_recvmsg+0x6ac/0xaf0 > > [ 3974.384251] inet_recvmsg+0x3c/0xa0 > > [ 3974.388394] sock_recvmsg+0x3d/0x50 > > [ 3974.392511] SYSC_recvfrom+0xd3/0x140 > > [ 3974.396826] ? handle_mm_fault+0xce/0x240 > > [ 3974.401535] ? SyS_futex+0x71/0x150 > > [ 3974.405653] SyS_recvfrom+0xe/0x10 > > [ 3974.409673] entry_SYSCALL_64_fastpath+0x1a/0xa9 > > [ 3974.415056] RIP: 0033:0x7f04ca9315bb > > [ 3974.419309] RSP: 002b:00007f04c955de70 EFLAGS: 00000246 ORIG_RAX: > > 000000000000002d > > [ 3974.428243] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: > > 00007f04ca9315bb > > [ 3974.436450] RDX: 0000000000020000 RSI: 00007f04bc0008f0 RDI: > > 0000000000000004 > > [ 3974.444653] RBP: 0000000000000000 R08: 0000000000000000 R09: > > 0000000000000000 > > [ 3974.452851] R10: 0000000000000000 R11: 0000000000000246 R12: > > 00007f04bc0008f0 > > [ 3974.461051] R13: 0000000000034ac8 R14: 00007f04bc020910 R15: > > 000000000001c480 > > [ 3974.469297] ---[ end trace 6fd472c9e1973d53 ]--- > > > > > > >> > > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > >> index 6cbde310abed..f82225725bc1 100644 > > >> --- a/mm/page_alloc.c > > >> +++ b/mm/page_alloc.c > > >> @@ -2481,7 +2481,7 @@ void free_hot_cold_page(struct page *page, bool > > >> cold) > > >> unsigned long pfn = page_to_pfn(page); > > >> int migratetype; > > >> > > >> - if (in_interrupt()) { > > >> + if (in_irq()) { > > >> __free_pages_ok(page, 0); > > >> return; > > >> } > > >> @@ -2647,7 +2647,7 @@ static struct page *__rmqueue_pcplist(struct > > >> zone *zone, int migratetype, > > >> { > > >> struct page *page; > > >> > > >> - VM_BUG_ON(in_interrupt()); > > >> + VM_BUG_ON(in_irq()); > > >> > > >> do { > > >> if (list_empty(list)) { > > >> @@ -2704,7 +2704,7 @@ struct page *rmqueue(struct zone *preferred_zone, > > >> unsigned long flags; > > >> struct page *page; > > >> > > >> - if (likely(order == 0) && !in_interrupt()) { > > >> + if (likely(order == 0) && !in_irq()) { > > >> page = rmqueue_pcplist(preferred_zone, zone, order, > > >> gfp_flags, migratetype); > > >> goto out; > > >> > > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>