Hello, It looks like a race with softirq and normal process context. Just thinking if we really want allocations from 'softirqs' to be done using per cpu list? Or we can have some check in 'free_hot_cold_page' for softirqs to check if we are on a path of returning from hard interrupt don't allocate from per cpu list. Thanks, Pankaj > > > > On 26/03/2017 11:21 AM, Tariq Toukan wrote: > > > > > > On 23/03/2017 4:51 PM, Mel Gorman wrote: > >> On Thu, Mar 23, 2017 at 02:43:47PM +0100, Jesper Dangaard Brouer wrote: > >>> On Wed, 22 Mar 2017 23:40:04 +0000 > >>> Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote: > >>> > >>>> On Wed, Mar 22, 2017 at 07:39:17PM +0200, Tariq Toukan wrote: > >>>>>>>> This modification may slow allocations from IRQ context slightly > >>>>>>>> but the > >>>>>>>> main gain from the per-cpu allocator is that it scales better for > >>>>>>>> allocations from multiple contexts. There is an implicit > >>>>>>>> assumption that > >>>>>>>> intensive allocations from IRQ contexts on multiple CPUs from a > >>>>>>>> single > >>>>>>>> NUMA node are rare > >>>>> Hi Mel, Jesper, and all. > >>>>> > >>>>> This assumption contradicts regular multi-stream traffic that is > >>>>> naturally > >>>>> handled > >>>>> over close numa cores. I compared iperf TCP multistream (8 streams) > >>>>> over CX4 (mlx5 driver) with kernels v4.10 (before this series) vs > >>>>> kernel v4.11-rc1 (with this series). > >>>>> I disabled the page-cache (recycle) mechanism to stress the page > >>>>> allocator, > >>>>> and see a drastic degradation in BW, from 47.5 G in v4.10 to 31.4 G in > >>>>> v4.11-rc1 (34% drop). > >>>>> I noticed queued_spin_lock_slowpath occupies 62.87% of CPU time. > >>>> > >>>> Can you get the stack trace for the spin lock slowpath to confirm it's > >>>> from IRQ context? > >>> > >>> AFAIK allocations happen in softirq. Argh and during review I missed > >>> that in_interrupt() also covers softirq. To Mel, can we use a in_irq() > >>> check instead? > >>> > >>> (p.s. just landed and got home) > > > > Glad to hear. Thanks for your suggestion. > > > >> > >> Not built or even boot tested. I'm unable to run tests at the moment > > > > Thanks Mel, I will test it soon. > > > Crashed in iperf single stream test: > > [ 3974.123386] ------------[ cut here ]------------ > [ 3974.128778] WARNING: CPU: 2 PID: 8754 at lib/list_debug.c:53 > __list_del_entry_valid+0xa3/0xd0 > [ 3974.138751] list_del corruption. prev->next should be > ffffea0040369c60, but was dead000000000100 > [ 3974.149016] Modules linked in: netconsole nfsv3 nfs fscache dm_mirror > dm_region_hash dm_log dm_mod sb_edac edac_core x86_pkg_temp_thermal > coretemp i2c_diolan_u2c kvm irqbypass ipmi_si ipmi_devintf crc32_pclmul > iTCO_wdt ghash_clmulni_intel ipmi_msghandler dcdbas iTCO_vendor_support > sg pcspkr lpc_ich shpchp wmi mfd_core acpi_power_meter nfsd auth_rpcgss > nfs_acl lockd grace sunrpc binfmt_misc ip_tables mlx4_en sr_mod cdrom > sd_mod i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt > fb_sys_fops mlx5_core ttm tg3 ahci libahci mlx4_core drm libata ptp > megaraid_sas crc32c_intel i2c_core pps_core [last unloaded: netconsole] > [ 3974.212743] CPU: 2 PID: 8754 Comm: iperf Not tainted 4.11.0-rc2+ #30 > [ 3974.220073] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS > 1.5.4 10/002/2015 > [ 3974.228974] Call Trace: > [ 3974.231925] <IRQ> > [ 3974.234405] dump_stack+0x63/0x8c > [ 3974.238355] __warn+0xd1/0xf0 > [ 3974.241891] warn_slowpath_fmt+0x4f/0x60 > [ 3974.246494] __list_del_entry_valid+0xa3/0xd0 > [ 3974.251583] get_page_from_freelist+0x84c/0xb40 > [ 3974.256868] ? napi_gro_receive+0x38/0x140 > [ 3974.261666] __alloc_pages_nodemask+0xca/0x200 > [ 3974.266866] mlx5e_alloc_rx_wqe+0x49/0x130 [mlx5_core] > [ 3974.272862] mlx5e_post_rx_wqes+0x84/0xc0 [mlx5_core] > [ 3974.278725] mlx5e_napi_poll+0xc7/0x450 [mlx5_core] > [ 3974.284409] net_rx_action+0x23d/0x3a0 > [ 3974.288819] __do_softirq+0xd1/0x2a2 > [ 3974.293054] irq_exit+0xb5/0xc0 > [ 3974.296783] do_IRQ+0x51/0xd0 > [ 3974.300353] common_interrupt+0x89/0x89 > [ 3974.304859] RIP: 0010:free_hot_cold_page+0x228/0x280 > [ 3974.310629] RSP: 0018:ffffc9000ea07c90 EFLAGS: 00000202 ORIG_RAX: > ffffffffffffffa8 > [ 3974.319565] RAX: 0000000000000001 RBX: ffff88103f85f158 RCX: > ffffea0040369c60 > [ 3974.327764] RDX: ffffea0040369c60 RSI: ffff88103f85f168 RDI: > ffffea0040369ca0 > [ 3974.335961] RBP: ffffc9000ea07cc0 R08: ffff88103f85f168 R09: > 00000000000005a8 > [ 3974.344178] R10: 00000000000005a8 R11: 0000000000010468 R12: > ffffea0040369c80 > [ 3974.352387] R13: ffff88103f85f168 R14: ffff88107ffdeb80 R15: > ffffea0040369ca0 > [ 3974.360577] </IRQ> > [ 3974.363145] __put_page+0x34/0x40 > [ 3974.367068] skb_release_data+0xca/0xe0 > [ 3974.371575] skb_release_all+0x24/0x30 > [ 3974.375984] __kfree_skb+0x12/0x20 > [ 3974.380003] tcp_recvmsg+0x6ac/0xaf0 > [ 3974.384251] inet_recvmsg+0x3c/0xa0 > [ 3974.388394] sock_recvmsg+0x3d/0x50 > [ 3974.392511] SYSC_recvfrom+0xd3/0x140 > [ 3974.396826] ? handle_mm_fault+0xce/0x240 > [ 3974.401535] ? SyS_futex+0x71/0x150 > [ 3974.405653] SyS_recvfrom+0xe/0x10 > [ 3974.409673] entry_SYSCALL_64_fastpath+0x1a/0xa9 > [ 3974.415056] RIP: 0033:0x7f04ca9315bb > [ 3974.419309] RSP: 002b:00007f04c955de70 EFLAGS: 00000246 ORIG_RAX: > 000000000000002d > [ 3974.428243] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: > 00007f04ca9315bb > [ 3974.436450] RDX: 0000000000020000 RSI: 00007f04bc0008f0 RDI: > 0000000000000004 > [ 3974.444653] RBP: 0000000000000000 R08: 0000000000000000 R09: > 0000000000000000 > [ 3974.452851] R10: 0000000000000000 R11: 0000000000000246 R12: > 00007f04bc0008f0 > [ 3974.461051] R13: 0000000000034ac8 R14: 00007f04bc020910 R15: > 000000000001c480 > [ 3974.469297] ---[ end trace 6fd472c9e1973d53 ]--- > > > >> > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >> index 6cbde310abed..f82225725bc1 100644 > >> --- a/mm/page_alloc.c > >> +++ b/mm/page_alloc.c > >> @@ -2481,7 +2481,7 @@ void free_hot_cold_page(struct page *page, bool > >> cold) > >> unsigned long pfn = page_to_pfn(page); > >> int migratetype; > >> > >> - if (in_interrupt()) { > >> + if (in_irq()) { > >> __free_pages_ok(page, 0); > >> return; > >> } > >> @@ -2647,7 +2647,7 @@ static struct page *__rmqueue_pcplist(struct > >> zone *zone, int migratetype, > >> { > >> struct page *page; > >> > >> - VM_BUG_ON(in_interrupt()); > >> + VM_BUG_ON(in_irq()); > >> > >> do { > >> if (list_empty(list)) { > >> @@ -2704,7 +2704,7 @@ struct page *rmqueue(struct zone *preferred_zone, > >> unsigned long flags; > >> struct page *page; > >> > >> - if (likely(order == 0) && !in_interrupt()) { > >> + if (likely(order == 0) && !in_irq()) { > >> page = rmqueue_pcplist(preferred_zone, zone, order, > >> gfp_flags, migratetype); > >> goto out; > >> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>