Re: Decoding a Linux kernel oops panic due to DMAR error

Denis Kirjanov <kirjanov@xxxxxxxxx> · Mon, 3 Feb 2014 14:29:37 +0400

On 2/3/14, Ahmed A <ahmedcali@xxxxxxxxx> wrote:
> Hello,
>
> I have a server with onboard Intel 10G ports (82599). When I load the kernel
> module driver for these ports, everything is fine, I can see the newly
> created ethX devices using "ip addr show".  However, after I assign an IP
> address, and right after I issue command to bring up the port, I get a
> kernel panic related to DMAR (DMA remapping) in the VFIO (Virtual Function
> IO) module.  I am not even
> sure why I am getting this panic since this Intel kernel module does not
> use VFIO.  I know why the panic is happening, NULL being sent as a
> parameter to function vfio_group_get(), in which it is being de-referenced.
> I
> know NULL is passed, because register RDI, which is used to pass the
> first argument to a function, contains 0.
>
> Linux kernel 3.6.11
>
> Following is the stack trace of panic:

You have to post your message to kvm@xxxxxxxxxxxxxxx and CC Alexey
Kardashevskiy <aik@xxxxxxxxx> who did add that function.

>
> # [11036.855410] BUG: unable to handle kernel [11036.887249] ixgbe
> 0000:84:00.0: eth6: detected SFP+: 3
> NULL pointer dereference at           (null)
> [11037.010224] IP: [<ffffffffa006615a>] vfio_group_get+0x9/0x27 [vfio]
> [11037.085047] PGD 1fd6b5b067 PUD 20404b1067 PMD 0
> [11037.140181] Oops: 0000 [#1] SMP
> [11037.178676] Modules linked in: ixgbe(O) nfsv3 autofs4 nfsd nfs_acl nfs
> lockd sunrpc vfio_pci vfio_iommu_type1 vfio i2c_mux i2c_smbus i2c_dev
> container ide_pci_generic ide_core uhci_hcd isci ata_generic
> [11037.393137] CPU 0
> [11037.414974] Pid: 14045, comm: kworker/0:0 Tainted: G           O 3.6.11
> [11037.539628] RIP: 0010:[<ffffffffa006615a>]  [<ffffffffa006615a>]
> vfio_group_get+0x9/0x27 [vfio]
> [11037.643521] RSP: 0018:ffff881f52453d00  EFLAGS: 00010282
> [11037.706886] RAX: ffff881fd6740680 RBX: 0000000000000000 RCX:
> ffff88204157ec00
> [11037.792053] RDX: 0000000000000084 RSI: 0000000001f5327a RDI:
> 0000000000000000
> [11037.877221] RBP: ffff881f52453d10 R08: ffff881f5327abe0 R09:
> 0000000000000000
> [11037.962394] R10: 0000000000000000 R11: 0000000000000000 R12:
> ffff88204157f800
> [11038.024995] ixgbe 0000:84:00.0: eth6: NIC Link is Up 10 Gbps, Flow
> Control: RX/TX
> [11038.025144] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
> [11038.211671] R13: 0000000000000084 R14: 0000000000000000 R15:
> 0000000000000000
> [11038.296842] FS:  0000000000000000(0000) GS:ffff88204f000000(0000)
> knlGS:0000000000000000
> [11038.393430] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [11038.461988] CR2: 0000000000000000 CR3: 0000001fd686d000 CR4:
> 00000000001407f0
> [11038.547156] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [11038.632326] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [11038.717496] Process kworker/0:0 (pid: 14045, threadinfo ffff881f52452000,
> task ffff882034d61950)
> [11038.822392] Stack:
> [11038.846298]  0000000000000084 ffff881fd6740680 ffff881f52453d30
> ffffffffa006618a
> [11038.934688]  0000000001f5327a ffff882035e23e00 ffff881f52453d50
> ffffffffa0066442
> [11039.023078]  ffff881f52453d70 ffff881fd6740680 ffff881f52453d70
> ffffffffa0072072
> [11039.111465] Call Trace:
> [11039.140571]  [<ffffffffa006618a>] vfio_device_get+0x12/0x30 [vfio]
> [11039.214324]  [<ffffffffa0066442>] vfio_device_get_from_dev+0x19/0x1f
> [vfio]
> [11039.297425]  [<ffffffffa0072072>] vfio_pci_dmar_error_handler+0x13/0x4a
> [vfio_pci]
> [11039.387796]  [<ffffffff81420cc6>] dmar_fault_do_one+0xd4/0xf1
> [11039.456366]  [<ffffffff8104175d>] process_one_work+0x1c2/0x311
> [11039.525968]  [<ffffffff81041568>] ? manage_workers+0x23a/0x24c
> [11039.595566]  [<ffffffff81420bf2>] ? dmar_get_fault_reason+0x52/0x52
> [11039.670354]  [<ffffffff81041b42>] worker_thread+0x26c/0x34a
> [11039.736840]  [<ffffffff810418d6>] ? process_scheduled_works+0x2a/0x2a
> [11039.813710]  [<ffffffff8104583a>] kthread+0x86/0x8e
> [11039.871891]  [<ffffffff81604bf4>] kernel_thread_helper+0x4/0x10
> [11039.942524]  [<ffffffff810457b4>] ?
> kthread_freezable_should_stop+0x4d/0x4d
> [11040.025618]  [<ffffffff81604bf0>] ? gs_change+0xb/0xb
> [11040.085865] Code: 48 8b 00 48 8b 40 20 48 85 c0 74 0c 55 48 8b 7f 40 48
> 89 e5 ff d0 eb 08 48 c7 c0 ea ff ff ff c3 5d c3 55 48 89 e5 53 48 89 fb 52
> <8b> 07 85 c0 75 11 be 2a 00 00 00 48 c7 c7 38 76 06 a0 e8 32 84
> [11040.312869] RIP  [<ffffffffa006615a>] vfio_group_get+0x9/0x27 [vfio]
> [11040.388722]  RSP <ffff881f52453d00>
> [11040.430282] CR2: 0000000000000000
>
>
>
> - Can someone please help me understand the damr/vfio related function calls
> in the back trace, and why they are getting invoked?  I know what causes
> DMAR error, but not sure how this could be happening, since none of the
> devices is managed by VFIO.
>
> - Looking at the source code, it seems dmar_fault_do_one() is called from
> interrupt handler dmar_fault().  I am just curious, why dmar_fault() is not
> part of the stack trace?
> - What is the significance of the "?" in front of some of the functions in
> the backtrace (e.g. dmar_get_fault_reason()).
>
> Thank you,
> Ahmed.
>

-- 
Regards,
Denis

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@xxxxxxxxxxxxxxxxx
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies