Natan, Thanks for the report. Would you mind re-posting this to the xen-users mailing list? You're much more likely to get someone there who's seen such a bug before. -George On Tue, Nov 7, 2017 at 11:12 PM, Nathan March <nathan@xxxxxx> wrote: > Since moving from 4.4 to 4.6, I’ve been seeing an increasing number of > stability issues on our hypervisors. I’m not clear if there’s a singular > root cause here, or if I’m dealing with multiple bugs… > > > > One of the more common ones I’ve seen, is a VM on shutdown will remain in > the null state and a kernel bug is thrown: > > > > xen001 log # xl list > > Name ID Mem VCPUs State > Time(s) > > Domain-0 0 6144 24 r----- > 6639.7 > > (null) 3 0 1 --pscd > 36.3 > > > > [89920.839074] BUG: unable to handle kernel paging request at > ffff88020ee9a000 > > [89920.839546] IP: [<ffffffff81430922>] __memcpy+0x12/0x20 > > [89920.839933] PGD 2008067 > > [89920.840022] PUD 17f43f067 > > [89920.840390] PMD 1e0976067 > > [89920.840469] PTE 0 > > [89920.840833] > > [89920.841123] Oops: 0000 [#1] SMP > > [89920.841417] Modules linked in: ebt_ip ebtable_filter ebtables > arptable_filter arp_tables bridge xen_pciback xen_gntalloc nfsd auth_rpcgss > nfsv3 nfs_acl nfs fscache lockd sunrpc grace 8021q mrp garp stp llc bonding > xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev xen_evtchn > xenfs xen_privcmd dcdbas fjes pcspkr ipmi_devintf ipmi_si ipmi_msghandler > joydev i2c_i801 i2c_smbus lpc_ich shpchp mei_me mei ioatdma ixgbe mdio igb > dca ptp pps_core uas usb_storage wmi ttm > > [89920.847080] CPU: 4 PID: 1471 Comm: loop6 Not tainted 4.9.58-29.el6.x86_64 > #1 > > [89920.847381] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.7.1 > 03/04/2015 > > [89920.847893] task: ffff8801b75e0700 task.stack: ffffc900460e0000 > > [89920.848192] RIP: e030:[<ffffffff81430922>] [<ffffffff81430922>] > __memcpy+0x12/0x20 > > [89920.848783] RSP: e02b:ffffc900460e3b20 EFLAGS: 00010246 > > [89920.849081] RAX: ffff88018916d000 RBX: ffff8801b75e0700 RCX: > 0000000000000200 > > [89920.849384] RDX: 0000000000000000 RSI: ffff88020ee9a000 RDI: > ffff88018916d000 > > [89920.849686] RBP: ffffc900460e3b38 R08: ffff88011da9fcf8 R09: > 0000000000000002 > > [89920.849989] R10: ffff88019535bddc R11: ffffea0006245b5c R12: > 0000000000001000 > > [89920.850294] R13: ffff88018916e000 R14: 0000000000001000 R15: > ffffc900460e3b68 > > [89920.850605] FS: 00007fb865c30700(0000) GS:ffff880204b00000(0000) > knlGS:0000000000000000 > > [89920.851118] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [89920.851418] CR2: ffff88020ee9a000 CR3: 00000001ef03b000 CR4: > 0000000000042660 > > [89920.851720] Stack: > > [89920.852009] ffffffff814375ca ffffc900460e3b38 ffffc900460e3d08 > ffffc900460e3bb8 > > [89920.852821] ffffffff814381c5 ffffc900460e3b68 ffffc900460e3d08 > 0000000000001000 > > [89920.853633] ffffc900460e3d88 0000000000000000 0000000000001000 > ffffea0000000000 > > [89920.854445] Call Trace: > > [89920.854741] [<ffffffff814375ca>] ? memcpy_from_page+0x3a/0x70 > > [89920.855043] [<ffffffff814381c5>] > iov_iter_copy_from_user_atomic+0x265/0x290 > > [89920.855354] [<ffffffff811cf633>] generic_perform_write+0xf3/0x1d0 > > [89920.855673] [<ffffffff8101e39a>] ? xen_load_tls+0xaa/0x160 > > [89920.855992] [<ffffffffc025cf2b>] nfs_file_write+0xdb/0x200 [nfs] > > [89920.856297] [<ffffffff81269062>] vfs_iter_write+0xa2/0xf0 > > [89920.856599] [<ffffffff815fa365>] lo_write_bvec+0x65/0x100 > > [89920.856899] [<ffffffff815fc375>] do_req_filebacked+0x195/0x300 > > [89920.857202] [<ffffffff815fc53b>] loop_queue_work+0x5b/0x80 > > [89920.857505] [<ffffffff810c6898>] kthread_worker_fn+0x98/0x1b0 > > [89920.857808] [<ffffffff818d9dca>] ? schedule+0x3a/0xa0 > > [89920.858108] [<ffffffff818ddbb6>] ? _raw_spin_unlock_irqrestore+0x16/0x20 > > [89920.858411] [<ffffffff810c6800>] ? kthread_probe_data+0x40/0x40 > > [89920.858713] [<ffffffff810c63f5>] kthread+0xe5/0x100 > > [89920.859014] [<ffffffff810c6310>] ? __kthread_init_worker+0x40/0x40 > > [89920.859317] [<ffffffff818de2d5>] ret_from_fork+0x25/0x30 > > [89920.859615] Code: 81 f3 00 00 00 00 e9 1e ff ff ff 90 90 90 90 90 90 90 > 90 90 90 90 90 90 90 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 > <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 > > [89920.864410] RIP [<ffffffff81430922>] __memcpy+0x12/0x20 > > [89920.864749] RSP <ffffc900460e3b20> > > [89920.865021] CR2: ffff88020ee9a000 > > [89920.865294] ---[ end trace b77d2ce5646284d1 ]--- > > > > Wondering if anyone has advice on how to troubleshoot the above, or might > have some insight into that the issue could be? This hypervisor was only up > for a day, had almost no VMs running on it since boot, I booted a single > windows test VM which BSOD’ed and then this happened. > > > > This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues > across a wide number of systems with from both Dell and Supermicro, although > we run the same Intel x540 10gb nic’s in each system with the same netapp > nfs backend storage. > > > > Cheers, > > Nathan > > > _______________________________________________ > CentOS-virt mailing list > CentOS-virt@xxxxxxxxxx > https://lists.centos.org/mailman/listinfo/centos-virt > _______________________________________________ CentOS-virt mailing list CentOS-virt@xxxxxxxxxx https://lists.centos.org/mailman/listinfo/centos-virt