Hello,
I have a server with onboard Intel 10G ports (82599). When I load the kernel
module driver for these ports, everything is fine, I can see the newly created ethX devices using "ip addr show". However, after I assign an IP address, and right after I issue command to bring up the port, I get a kernel panic related to DMAR (DMA
remapping) in the VFIO (Virtual Function IO) module. I am not even
sure why I am getting this panic since this Intel kernel module does not
use VFIO. I know why the panic is happening, NULL being sent as a
parameter to function vfio_group_get(), in which it is being de-referenced. I
know NULL is passed, because register RDI, which is used to pass the
first argument to a function, contains 0.
Linux kernel 3.6.11
Following is the stack trace of panic:
# [11036.855410] BUG: unable to handle kernel [11036.887249] ixgbe 0000:84:00.0: eth6: detected SFP+: 3 NULL pointer dereference at (null) [11037.010224] IP: [<ffffffffa006615a>] vfio_group_get+0x9/0x27 [vfio] [11037.085047] PGD 1fd6b5b067 PUD 20404b1067 PMD 0 [11037.140181] Oops: 0000 [#1] SMP [11037.178676] Modules linked in: ixgbe(O) nfsv3 autofs4 nfsd nfs_acl nfs lockd sunrpc vfio_pci vfio_iommu_type1 vfio i2c_mux i2c_smbus i2c_dev container ide_pci_generic ide_core uhci_hcd isci ata_generic [11037.393137] CPU 0 [11037.414974] Pid: 14045, comm: kworker/0:0 Tainted: G O 3.6.11 [11037.539628] RIP: 0010:[<ffffffffa006615a>] [<ffffffffa006615a>] vfio_group_get+0x9/0x27 [vfio] [11037.643521] RSP: 0018:ffff881f52453d00 EFLAGS: 00010282 [11037.706886] RAX: ffff881fd6740680 RBX: 0000000000000000 RCX: ffff88204157ec00 [11037.792053] RDX: 0000000000000084 RSI: 0000000001f5327a RDI: 0000000000000000 [11037.877221] RBP: ffff881f52453d10 R08: ffff881f5327abe0 R09: 0000000000000000 [11037.962394] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88204157f800 [11038.024995] ixgbe 0000:84:00.0: eth6: NIC Link is Up 10 Gbps, Flow Control: RX/TX [11038.025144] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready [11038.211671] R13: 0000000000000084 R14: 0000000000000000 R15: 0000000000000000 [11038.296842] FS: 0000000000000000(0000) GS:ffff88204f000000(0000) knlGS:0000000000000000 [11038.393430] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [11038.461988] CR2: 0000000000000000 CR3: 0000001fd686d000 CR4: 00000000001407f0 [11038.547156] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [11038.632326] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [11038.717496] Process kworker/0:0 (pid: 14045, threadinfo ffff881f52452000, task ffff882034d61950) [11038.822392] Stack: [11038.846298] 0000000000000084 ffff881fd6740680 ffff881f52453d30 ffffffffa006618a [11038.934688] 0000000001f5327a ffff882035e23e00 ffff881f52453d50 ffffffffa0066442 [11039.023078] ffff881f52453d70 ffff881fd6740680 ffff881f52453d70 ffffffffa0072072 [11039.111465] Call Trace: [11039.140571] [<ffffffffa006618a>] vfio_device_get+0x12/0x30 [vfio] [11039.214324] [<ffffffffa0066442>] vfio_device_get_from_dev+0x19/0x1f [vfio] [11039.297425] [<ffffffffa0072072>] vfio_pci_dmar_error_handler+0x13/0x4a [vfio_pci] [11039.387796] [<ffffffff81420cc6>] dmar_fault_do_one+0xd4/0xf1 [11039.456366] [<ffffffff8104175d>] process_one_work+0x1c2/0x311 [11039.525968] [<ffffffff81041568>] ? manage_workers+0x23a/0x24c [11039.595566] [<ffffffff81420bf2>] ? dmar_get_fault_reason+0x52/0x52 [11039.670354] [<ffffffff81041b42>] worker_thread+0x26c/0x34a [11039.736840] [<ffffffff810418d6>] ? process_scheduled_works+0x2a/0x2a [11039.813710] [<ffffffff8104583a>] kthread+0x86/0x8e [11039.871891] [<ffffffff81604bf4>] kernel_thread_helper+0x4/0x10 [11039.942524] [<ffffffff810457b4>] ? kthread_freezable_should_stop+0x4d/0x4d [11040.025618] [<ffffffff81604bf0>] ? gs_change+0xb/0xb [11040.085865] Code: 48 8b 00 48 8b 40 20 48 85 c0 74 0c 55 48 8b 7f 40 48 89 e5 ff d0 eb 08 48 c7 c0 ea ff ff ff c3 5d c3 55 48 89 e5 53 48 89 fb 52 <8b> 07 85 c0 75 11 be 2a 00 00 00 48 c7 c7 38 76 06 a0 e8 32 84 [11040.312869] RIP [<ffffffffa006615a>] vfio_group_get+0x9/0x27 [vfio] [11040.388722] RSP <ffff881f52453d00> [11040.430282] CR2: 0000000000000000
- Can someone please help me understand the damr/vfio related function calls in the back trace, and why they are getting invoked? I know what causes DMAR error, but not sure how this could be happening, since none of the devices is managed by VFIO.
- Looking at the source code, it seems dmar_fault_do_one() is called from interrupt handler dmar_fault(). I am just curious, why dmar_fault() is not part of the stack trace?
- What is the significance of the "?" in front of some of the functions in the backtrace (e.g. dmar_get_fault_reason()).
Thank you,
Ahmed.
_______________________________________________ Kernelnewbies mailing list Kernelnewbies@xxxxxxxxxxxxxxxxx http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies