Re: Kernel panic in __pci_enable_msix_range on Xen PV with PCI passthrough

Jan Beulich <jbeulich@xxxxxxxx> · Wed, 25 Aug 2021 17:55:09 +0200

On 25.08.2021 17:47, Marek Marczykowski-Górecki wrote:
> On Wed, Aug 25, 2021 at 05:33:54PM +0200, Jan Beulich wrote:
>> On 25.08.2021 17:24, Marek Marczykowski-Górecki wrote:
>>> On recent kernel I get kernel panic when starting a Xen PV domain with
>>> PCI devices assigned. This happens on 5.10.60 (worked on .54) and
>>> 5.4.142 (worked on .136): 
>>>
>>> [   13.683009] pcifront pci-0: claiming resource 0000:00:00.0/0
>>> [   13.683042] pcifront pci-0: claiming resource 0000:00:00.0/1
>>> [   13.683049] pcifront pci-0: claiming resource 0000:00:00.0/2
>>> [   13.683055] pcifront pci-0: claiming resource 0000:00:00.0/3
>>> [   13.683061] pcifront pci-0: claiming resource 0000:00:00.0/6
>>> [   14.036142] e1000e: Intel(R) PRO/1000 Network Driver
>>> [   14.036179] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
>>> [   14.036982] e1000e 0000:00:00.0: Xen PCI mapped GSI11 to IRQ13
>>> [   14.044561] e1000e 0000:00:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
>>> [   14.045188] BUG: unable to handle page fault for address: ffffc9004069100c
>>> [   14.045197] #PF: supervisor write access in kernel mode
>>> [   14.045202] #PF: error_code(0x0003) - permissions violation
>>> [   14.045211] PGD 18f1c067 P4D 18f1c067 PUD 4dbd067 PMD 4fba067 PTE 80100000febd4075
>>
>> I'm curious what lives at physical address FEBD4000. 
> 
> This is a third BAR of this device, related to MSI-X:
> 
> 00:04.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
>         Subsystem: Intel Corporation Device 0000
>         Physical Slot: 4
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 11
>         Region 0: Memory at feb80000 (32-bit, non-prefetchable) [size=128K]
>         Region 1: Memory at feba0000 (32-bit, non-prefetchable) [size=128K]
>         Region 2: I/O ports at c080 [size=32]
>         Region 3: Memory at febd4000 (32-bit, non-prefetchable) [size=16K]
>         Expansion ROM at feb40000 [disabled] [size=256K]
>         Capabilities: [c8] Power Management version 2
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
>                 Address: 0000000000000000  Data: 0000
>         Capabilities: [e0] Express (v1) Endpoint, MSI 00
>                 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
>                 DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
>                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>                         MaxPayload 128 bytes, MaxReadReq 128 bytes
>                 DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
>                 LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <64ns
>                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
>                 LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
>                         TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
>         Capabilities: [a0] MSI-X: Enable- Count=5 Masked-
>                 Vector table: BAR=3 offset=00000000
>                 PBA: BAR=3 offset=00002000
>         Kernel driver in use: pciback
>         Kernel modules: e1000e
> 
>> The maximum verbosity
>> hypervisor log may also have a hint as to why this is a read-only PTE.
> 
> I'll try, if that still makes sense.

I think the above data clarifies it already.

>>> [   14.045227] Oops: 0003 [#1] SMP NOPTI
>>> [   14.045234] CPU: 0 PID: 234 Comm: kworker/0:2 Tainted: G        W         5.14.0-rc7-1.fc32.qubes.x86_64 #15
>>> [   14.045245] Workqueue: events work_for_cpu_fn
>>> [   14.045259] RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0
>>> [   14.045271] Code: 2f 96 ff 48 89 44 24 28 48 89 c7 48 85 c0 0f 84 f6 01 00 00 45 0f b7 f6 48 8d 40 0c ba 01 00 00 00 49 c1 e6 04 4a 8d 4c 37 1c <89> 10 48 83 c0 10 48 39 c1 75 f5 41 0f b6 44 24 6a 84 c0 0f 84 48
>>> [   14.045284] RSP: e02b:ffffc9004018bd50 EFLAGS: 00010212
>>> [   14.045290] RAX: ffffc9004069100c RBX: ffff88800ed412f8 RCX: ffffc9004069105c
>>> [   14.045296] RDX: 0000000000000001 RSI: 00000000000febd4 RDI: ffffc90040691000
>>> [   14.045302] RBP: 0000000000000003 R08: 0000000000000000 R09: 00000000febd404f
>>> [   14.045308] R10: 0000000000007ff0 R11: ffff88800ee8ae40 R12: ffff88800ed41000
>>> [   14.045313] R13: 0000000000000000 R14: 0000000000000040 R15: 00000000feba0000
>>> [   14.045393] FS:  0000000000000000(0000) GS:ffff888018400000(0000) knlGS:0000000000000000
>>> [   14.045401] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   14.045407] CR2: ffff8000007f5ea0 CR3: 0000000012f6a000 CR4: 0000000000000660
>>> [   14.045420] Call Trace:
>>> [   14.045431]  e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e]
>>> [   14.045479]  e1000_probe+0x41f/0xdb0 [e1000e]
>>
>> Otoh, from this it's pretty clear it's not a device Xen may have found
>> a need to access for its own purposes. If aforementioned address covers
>> (or is adjacent to) the MSI-X table of a device drive by this driver,
>> then it would also be helpful to know how many MSI-X entries the device
>> reports its table can have.
> 
> See above.
> 
> Does PCI passthrough for on PV support MSI-X at all?

It is supposed to work. The treatment by generic code shouldn't be overly
different from how MSI-X works for Dom0 (Xen specific code of course
differs).

> If so, I guess the issue is the kernel trying to write directly, instead
> of via some hypercall, right?

Indeed. Or to be precise - the kernel isn't supposed to be "writing" this
at all. It is supposed to make hypercalls which may result in such writes.
Such "mask everything" functionality imo is the job of the hypervisor
anyway when talking about PV environments; HVM is a different thing here.

Jan