Hi Dexuan, On Thu, Aug 28, 2014 at 03:21:59AM +0000, Dexuan Cui wrote: > > > > First let me thank you guys for looking into this issue. Looking at Feel free to add Reported-by: Sitsofe Wheeler <sitsofe@xxxxxxxxx> to your patches (it's useful too me because it makes it easier for me to show what I've been doing to others :-) > > > The only issue seen on boot now is similar to > > > https://lkml.org/lkml/2014/8/19/227 ... > I don't see this issue. Do you still see the issue for EVERY boot > after you applied KY's always-use-page-allocation patch? I doubt that > because in the log of the above link: I think it depends on if I do a UP or SMP boot. With f1bd473f95e02bc382d4dae94d7f82e2a455e05d (post v3.17-rc2) with the V2 BUG_ON patch set coupled with the allocation change patch set a UP boot was able to run a small bunch of CPU and network stress tests without any issue. However, when doing an SMP boot the following happened: [ OK ] Started Getty on tty1. [ OK ] Reached target Login Prompts. [ OK ] Started Login Service. [ 36.023957] hv_netvsc vmbus_0_15: net device safe to remove [ 36.078386] hv_netvsc: hv_netvsc channel opened successfully Fedora release 20 ([ 36.964636] hv_netvsc vmbus_0_15: Send section size: 6144, Section count:2560 [ 37.014857] hv_netvsc vmbus_0_15: Device MAC 00:15:5d:6f:02:a5 link state up Heisenbug)[ 37.180170] BUG: unable to handle kernel pKageing reqruestn at ffff8801f2ec4068 [ 37.181156] IP: el[ 3.1<7.0-ffrc2.x8ffffff814e77dc>] netvsc_start_xmit+0x6ac/0x7c0 [ 6_ 64- 00044-g0bb037.181156] PGD 2db0067 PUa5Dd on 2 an x86_6407dc0067 PM (ttyDS0) 207c280ara67refedora PTE 8 0000001f2eclo4gin0: 60 [ 37.181156] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 37.181156] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.17.0-rc2.x86_64-00044-g0bb0a5d #132 [ 37.181156] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012 [ 37.181156] task: ffff8801fb27b9f0 ti: ffff8801fb310000 task.ti: ffff8801fb310000 [ 37.181156] RIP: 0010:[<ffffffff814e77dc>] [<ffffffff814e77dc>] netvsc_start_xmit+0x6ac/0x7c0 [ 37.181156] RSP: 0018:ffff880206c43960 EFLAGS: 00010246 [ 37.181156] RAX: 0000000000000000 RBX: ffff8801f2e9f2d8 RCX: 000000000007f000 [ 37.181156] RDX: 000000000007da20 RSI: 0000000000034f70 RDI: 000000000007efc8 [ 37.181156] RBP: ffff880206c439a8 R08: 0000000000000000 R09: 0000000000000000 [ 37.181156] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 [ 37.181156] R13: ffff8801f2ec4000 R14: ffff8800f1335680 R15: ffff8801f2e9c4ea [ 37.181156] FS: 0000000000000000(0000) GS:ffff880206c40000(0000) knlGS:0000000000000000 [ 37.181156] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 37.181156] CR2: ffff8801f2ec4068 CR3: 0000000001c0e000 CR4: 00000000000406e0 [ 37.181156] Stack: [ 37.181156] ffff8801f2e9f344 ffffffff00000062 ffff8800f1335680 0000000081d23300 [ 37.181156] 0000000000000062 ffff8800f1335680 0000000400014883 ffff8800ecc56fa0 [ 37.181156] ffff8801f2ec4000 ffff880206c43a00 ffffffff815ce458 ffff8800f13311c8 [ 37.181156] Call Trace: [ 37.181156] <IRQ> [ 37.181156] [<ffffffff815ce458>] dev_hard_start_xmit+0x348/0x630 [ 37.181156] [<ffffffff815ef5aa>] sch_direct_xmit+0x7a/0x290 That's all I got - all my consoles froze and not even sysrq worked any more. Here's another splat that also just happened: [ OK ] Started NTP client/server. [ OK ] Started Login Service. [ 32.558951] hv_netvsc vmbus_0_15: net device safe to remove [ 32.599919] hv_netvsc: hv_netvsc channel opened successfully [ 33.728294] hv_netvsc vmbus_0_15: Send section size: 6144, Section count:2560 [ 33.782371] hv_netvsc vmbus_0_15: Device MAC 00:15:5d:6f:02:a5 link state up Fedora release 20 (Heisenbug) Kernel 3.17.0-rc2.x86_64-00044-g0bb0a5d on an x86_64 (ttyS0) ararefedora login: [ 35.301491] BUG: unable to handle kernel paging request at ffff8800ef0df973 [ 35.302392] IP: [<ffffffff814e701d>] netvsc_select_queue+0x3d/0x150 [ 35.302392] PGD 2db0067 PUD 2075be067 PMD 207445067 PTE 80000000ef0df060 [ 35.302392] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 35.302392] CPU: 1 PID: 723 Comm: arping Not tainted 3.17.0-rc2.x86_64-00044-g0bb0a5d #132 [ 35.302392] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012 [ 35.302392] task: ffff8800eea68000 ti: ffff8801f5af0000 task.ti: ffff8801f5af0000 [ 35.302392] RIP: 0010:[<ffffffff814e701d>] [<ffffffff814e701d>] netvsc_select_queue+0x3d/0x150 [ 35.302392] RSP: 0018:ffff8801f5af3c60 EFLAGS: 00010206 [ 35.302392] RAX: 0000000000000000 RBX: ffff8800f0c49160 RCX: 000000000000ffff [ 35.302392] RDX: ffff8800ef0cf968 RSI: ffff8801fa42a1c0 RDI: ffff8800f0c49160 [ 35.302392] RBP: ffff8801f5af3c88 R08: 000000000000002a R09: 0000000000000000 [ 35.302392] R10: ffff8801f372b3d8 R11: 000000000000000a R12: ffff8801fa42a1c0 [ 35.302392] R13: 0000000000000000 R14: ffff8801f3832e68 R15: ffff8801fa42a1c0 [ 35.302392] FS: 00007f3359e96740(0000) GS:ffff880206c20000(0000) knlGS:0000000000000000 [ 36.090561] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 36.090561] CR2: ffff8800ef0df973 CR3: 00000000ef1fe000 CR4: 00000000000406e0 [ 36.090561] Stack: [ 36.090561] ffffffff8167f591 ffff8800f0c49160 000000000000001c 0000000000000000 [ 36.090561] ffff8801f3832e68 ffff8801f5af3d48 ffffffff816832fc ffff8800eea68740 [ 36.090561] 0000000000000000 ffff8801f5af3d68 0000000000000046 000000000000001c [ 36.090561] Call Trace: [ 36.090561] [<ffffffff8167f591>] ? packet_pick_tx_queue+0x31/0xa0 [ 36.090561] [<ffffffff816832fc>] packet_sendmsg+0xc1c/0xdd0 [ 36.090561] [<ffffffff810bd106>] ? lock_release_non_nested+0xc6/0x330 [ 36.090561] [<ffffffff815b42a8>] sock_sendmsg+0x88/0xb0 [ 36.090561] [<ffffffff81185443>] ? might_fault+0xa3/0xb0 [ 36.090561] [<ffffffff811853fa>] ? might_fault+0x5a/0xb0 [ 36.090561] [<ffffffff815b43de>] SYSC_sendto+0x10e/0x150 [ 36.090561] [<ffffffff811853fa>] ? might_fault+0x5a/0xb0 [ 36.090561] [<ffffffff816a3215>] ? sysret_check+0x22/0x5d [ 36.090561] [<ffffffff810b97fd>] ? trace_hardirqs_on_caller+0x17d/0x210 [ 36.090561] [<ffffffff8139c09e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 36.090561] [<ffffffff815b53be>] SyS_sendto+0xe/0x10 [ 36.090561] [<ffffffff816a31e9>] system_call_fastpath+0x16/0x1b [ 36.090561] Code: 00 4d 85 d2 0f 84 1c 01 00 00 44 8b 9f 8c 03 00 00 31 c0 41 83 fb 01 0f 86 1b 01 00 00 0f b7 8e b4 00 00 00 48 8b 96 c0 00 00 00 <66> 83 7c 0a 0c 08 0f 85 01 01 00 00 55 48 89 e5 41 55 41 54 53 [ 36.090561] RIP [<ffffffff814e701d>] netvsc_select_queue+0x3d/0x150 [ 36.090561] RSP <ffff8801f5af3c60> [ 36.090561] CR2: ffff8800ef0df973 [ 36.090561] ---[ end trace 7e294eb7f2c54206 ]--- [ 36.090561] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:41 [ 36.090561] in_atomic(): 0, irqs_disabled(): 1, pid: 723, name: arping [ 36.090561] INFO: lockdep is turned off. [ 36.090561] irq event stamp: 4692 [ 36.090561] hardirqs last enabled at (4691): [<ffffffff81698bf4>] __slab_alloc+0x50b/0x576 [ 36.090561] hardirqs last disabled at (4692): [<ffffffff816a5266>] error_sti+0x5/0x6 [ 36.090561] softirqs last enabled at (4542): [<ffffffff815cecf0>] __dev_queue_xmit+0x5b0/0x690 [ 36.090561] softirqs last disabled at (4508): [<ffffffff815ce798>] __dev_queue_xmit+0x58/0x690 [ 36.090561] CPU: 1 PID: 723 Comm: arping Tainted: G D 3.17.0-rc2.x86_64-00044-g0bb0a5d #132 [ 36.090561] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012 [ 36.090561] 0000000000000046 ffff8801f5af38e0 ffffffff8169a64b ffff8800eea68000 [ 36.090561] ffff8801f5af38f8 ffffffff8109ec65 ffff8801f3bc5c18 ffff8801f5af3918 [ 36.090561] ffffffff816a0c84 ffffffff81090f38 ffff8800eea68000 ffff8801f5af3938 [ 36.090561] Call Trace: [ 36.090561] [<ffffffff8169a64b>] dump_stack+0x4d/0x66 [ 36.090561] [<ffffffff8109ec65>] __might_sleep+0x115/0x120 [ 36.090561] [<ffffffff816a0c84>] down_read+0x24/0x70 [ 36.090561] [<ffffffff81090f38>] ? __validate_process_creds+0xd8/0xf0 [ 36.090561] [<ffffffff8107f9d4>] exit_signals+0x24/0x140 [ 36.090561] [<ffffffff810737d9>] do_exit+0x129/0xa20 [ 36.090561] [<ffffffff810c4bcc>] ? kmsg_dump+0xfc/0x110 [ 36.090561] [<ffffffff810c4af5>] ? kmsg_dump+0x25/0x110 [ 36.090561] [<ffffffff81006348>] oops_end+0xa8/0xc0 [ 36.090561] [<ffffffff816951c8>] no_context+0x322/0x36b [ 36.090561] [<ffffffff810b97fd>] ? trace_hardirqs_on_caller+0x17d/0x210 [ 36.090561] [<ffffffff816953dc>] __bad_area_nosemaphore+0x1cb/0x1e8 [ 36.090561] [<ffffffff810b97fd>] ? trace_hardirqs_on_caller+0x17d/0x210 [ 36.090561] [<ffffffff8169540c>] bad_area_nosemaphore+0x13/0x15 [ 36.090561] [<ffffffff8104040e>] __do_page_fault+0x1ee/0x4f0 [ 36.090561] [<ffffffff815bccae>] ? __alloc_skb+0x4e/0x240 [ 36.090561] [<ffffffff810bd106>] ? lock_release_non_nested+0xc6/0x330 [ 36.090561] [<ffffffff8139c0dd>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 36.090561] [<ffffffff81040762>] do_page_fault+0x22/0x30 [ 36.090561] [<ffffffff816a5048>] page_fault+0x28/0x30 [ 36.090561] [<ffffffff814e701d>] ? netvsc_select_queue+0x3d/0x150 [ 36.090561] [<ffffffff8167f591>] ? packet_pick_tx_queue+0x31/0xa0 [ 36.090561] [<ffffffff816832fc>] packet_sendmsg+0xc1c/0xdd0 [ 36.090561] [<ffffffff810bd106>] ? lock_release_non_nested+0xc6/0x330 [ 36.090561] [<ffffffff815b42a8>] sock_sendmsg+0x88/0xb0 [ 36.090561] [<ffffffff81185443>] ? might_fault+0xa3/0xb0 [ 36.090561] [<ffffffff811853fa>] ? might_fault+0x5a/0xb0 [ 36.090561] [<ffffffff815b43de>] SYSC_sendto+0x10e/0x150 [ 36.090561] [<ffffffff811853fa>] ? might_fault+0x5a/0xb0 [ 36.090561] [<ffffffff816a3215>] ? sysret_check+0x22/0x5d [ 36.090561] [<ffffffff810b97fd>] ? trace_hardirqs_on_caller+0x17d/0x210 [ 36.090561] [<ffffffff8139c09e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 36.090561] [<ffffffff815b53be>] SyS_sendto+0xe/0x10 [ 36.090561] [<ffffffff816a31e9>] system_call_fastpath+0x16/0x1b We can spin these off into a different thread if that would be helpful. > [ 34.628072] hv_netvsc vmbus_0_15: net device safe to remove > [ 34.676573] hv_netvsc: hv_netvsc channel opened successfully > [ 34.860292] hv_netvsc vmbus_0_15 eth1: unable to establish send buffer's gpadl > [ 34.948983] hv_netvsc vmbus_0_15 eth1: unable to connect to NetVSP - 4 > > Here the 4 is just HV_STATUS_INVALID_ALIGNMENT -- it should be fixed > by the patch. I've double checked and I don't see that message any more so I must have been mistaken. > > That is good to hear. I was under the impression that this issue would be > > resolved with all the cleanup we have done. The last patch-set I posted > > earlier today has the fix for vmbus_open bug that Dexuan had identified. > > > > If you could try with the BUG_ON elimination patch-set I sent out earlier > > today with the fix in hv.c that I had sent that would be great. I've switched to it (see above) and a number of issues have been resolved. > > > How come previous alignment efforts weren't working out? > I'm not sure. > If we trust the hypervisor, I would guess in hv_post_message() > 1) We'd better add "aligned_msg->reserved = 0;" > 2) Should we make sure "aligned_msg->payload_size % 8 == 0"? IMO > aligned_msg->payload is an array of 8-byte. In that case why would payload_size not be a multiple of 8 - can it change due to debug padding? If so wouldn't its start have had to be misaligned? -- Sitsofe | http://sucs.org/~sits/ _______________________________________________ devel mailing list devel@xxxxxxxxxxxxxxxxxxxxxx http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel