Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

"Nathan March" <nathan@xxxxxx> · Tue, 7 Nov 2017 15:12:10 -0800

Since moving from 4.4 to 4.6, I’ve been seeing an increasing number of stability issues on our hypervisors. I’m not clear if there’s a singular root cause here, or if I’m dealing with multiple bugs…

One of the more common ones I’ve seen, is a VM on shutdown will remain in the null state and a kernel bug is thrown:

xen001 log # xl list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  6144    24     r-----    6639.7
(null)                                       3     0     1     --pscd      36.3

[89920.839074] BUG: unable to handle kernel paging request at ffff88020ee9a000
[89920.839546] IP: [<ffffffff81430922>] __memcpy+0x12/0x20
[89920.839933] PGD 2008067 
[89920.840022] PUD 17f43f067 
[89920.840390] PMD 1e0976067 
[89920.840469] PTE 0
[89920.840833] 
[89920.841123] Oops: 0000 [#1] SMP
[89920.841417] Modules linked in: ebt_ip ebtable_filter ebtables arptable_filter arp_tables bridge xen_pciback xen_gntalloc nfsd auth_rpcgss nfsv3 nfs_acl nfs fscache lockd sunrpc grace 8021q mrp garp stp llc bonding xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev xen_evtchn xenfs xen_privcmd dcdbas fjes pcspkr ipmi_devintf ipmi_si ipmi_msghandler joydev i2c_i801 i2c_smbus lpc_ich shpchp mei_me mei ioatdma ixgbe mdio igb dca ptp pps_core uas usb_storage wmi ttm
[89920.847080] CPU: 4 PID: 1471 Comm: loop6 Not tainted 4.9.58-29.el6.x86_64 #1
[89920.847381] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.7.1 03/04/2015
[89920.847893] task: ffff8801b75e0700 task.stack: ffffc900460e0000
[89920.848192] RIP: e030:[<ffffffff81430922>]  [<ffffffff81430922>] __memcpy+0x12/0x20
[89920.848783] RSP: e02b:ffffc900460e3b20  EFLAGS: 00010246
[89920.849081] RAX: ffff88018916d000 RBX: ffff8801b75e0700 RCX: 0000000000000200
[89920.849384] RDX: 0000000000000000 RSI: ffff88020ee9a000 RDI: ffff88018916d000
[89920.849686] RBP: ffffc900460e3b38 R08: ffff88011da9fcf8 R09: 0000000000000002
[89920.849989] R10: ffff88019535bddc R11: ffffea0006245b5c R12: 0000000000001000
[89920.850294] R13: ffff88018916e000 R14: 0000000000001000 R15: ffffc900460e3b68
[89920.850605] FS:  00007fb865c30700(0000) GS:ffff880204b00000(0000) knlGS:0000000000000000
[89920.851118] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[89920.851418] CR2: ffff88020ee9a000 CR3: 00000001ef03b000 CR4: 0000000000042660
[89920.851720] Stack:
[89920.852009]  ffffffff814375ca ffffc900460e3b38 ffffc900460e3d08 ffffc900460e3bb8
[89920.852821]  ffffffff814381c5 ffffc900460e3b68 ffffc900460e3d08 0000000000001000
[89920.853633]  ffffc900460e3d88 0000000000000000 0000000000001000 ffffea0000000000
[89920.854445] Call Trace:
[89920.854741]  [<ffffffff814375ca>] ? memcpy_from_page+0x3a/0x70
[89920.855043]  [<ffffffff814381c5>] iov_iter_copy_from_user_atomic+0x265/0x290
[89920.855354]  [<ffffffff811cf633>] generic_perform_write+0xf3/0x1d0
[89920.855673]  [<ffffffff8101e39a>] ? xen_load_tls+0xaa/0x160
[89920.855992]  [<ffffffffc025cf2b>] nfs_file_write+0xdb/0x200 [nfs]
[89920.856297]  [<ffffffff81269062>] vfs_iter_write+0xa2/0xf0
[89920.856599]  [<ffffffff815fa365>] lo_write_bvec+0x65/0x100
[89920.856899]  [<ffffffff815fc375>] do_req_filebacked+0x195/0x300
[89920.857202]  [<ffffffff815fc53b>] loop_queue_work+0x5b/0x80
[89920.857505]  [<ffffffff810c6898>] kthread_worker_fn+0x98/0x1b0
[89920.857808]  [<ffffffff818d9dca>] ? schedule+0x3a/0xa0
[89920.858108]  [<ffffffff818ddbb6>] ? _raw_spin_unlock_irqrestore+0x16/0x20
[89920.858411]  [<ffffffff810c6800>] ? kthread_probe_data+0x40/0x40
[89920.858713]  [<ffffffff810c63f5>] kthread+0xe5/0x100
[89920.859014]  [<ffffffff810c6310>] ? __kthread_init_worker+0x40/0x40
[89920.859317]  [<ffffffff818de2d5>] ret_from_fork+0x25/0x30
[89920.859615] Code: 81 f3 00 00 00 00 e9 1e ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 
[89920.864410] RIP  [<ffffffff81430922>] __memcpy+0x12/0x20
[89920.864749]  RSP <ffffc900460e3b20>
[89920.865021] CR2: ffff88020ee9a000
[89920.865294] ---[ end trace b77d2ce5646284d1 ]---

Wondering if anyone has advice on how to troubleshoot the above, or might have some insight into that the issue could be? This hypervisor was only up for a day, had almost no VMs running on it since boot, I booted a single windows test VM which BSOD’ed and then this happened.

This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues across a wide number of systems with from both Dell and Supermicro, although we run the same Intel x540 10gb nic’s in each system with the same netapp nfs backend storage.

Cheers,
Nathan
_______________________________________________
CentOS-virt mailing list
CentOS-virt@xxxxxxxxxx
https://lists.centos.org/mailman/listinfo/centos-virt