Hi Leo,
Sorry for the delay - I'm on holiday this week, but since I've made the
mistake of glancing at my inbox I should probably save you from wasting
any more time...
On 2019-03-15 11:03 am, Auger Eric wrote:
Hi Leo,
+ Jean-Philippe
On 3/15/19 10:37 AM, Leo Yan wrote:
Hi Eric, Robin,
On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote:
[...]
If the NIC supports MSIs they logically are used. This can be easily
checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
check whether the guest received any interrupt? I remember that Robin
said in the past that on Juno, the MSI doorbell was in the PCI host
bridge window and possibly transactions towards the doorbell could not
reach it since considered as peer to peer.
I found back Robin's explanation. It was not related to MSI IOVA being
within the PCI host bridge window but RAM GPA colliding with host PCI
config space?
"MSI doorbells integral to PCIe root complexes (and thus untranslatable)
typically have a programmable address, so could be anywhere. In the more
general category of "special hardware addresses", QEMU's default ARM
guest memory map puts RAM starting at 0x40000000; on the ARM Juno
platform, that happens to be where PCI config space starts; as Juno's
PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
the PCI bus to a guest (all of it, given the lack of ACS), the root
complex just sees the guest's attempts to DMA to "memory" as the device
attempting to access config space and aborts them."
Below is some following investigation at my side:
Firstly, must admit that I don't understand well for up paragraph; so
based on the description I am wandering if can use INTx mode and if
it's lucky to avoid this hardware pitfall.
The problem above is that during the assignment process, the virtualizer
maps the whole guest RAM though the IOMMU (+ the MSI doorbell on ARM) to
allow the device, programmed in GPA to access the whole guest RAM.
Unfortunately if the device emits a DMA request with 0x40000000 IOVA
address, this IOVA is interpreted by the Juno RC as a transaction
towards the PCIe config space. So this DMA request will not go beyond
the RC, will never reach the IOMMU and will never reach the guest RAM.
So globally the device is not able to reach part of the guest RAM.
That's how I interpret the above statement. Then I don't know the
details of the collision, I don't have access to this HW. I don't know
either if this problem still exists on the r2 HW.
The short answer is that if you want PCI passthrough to work on Juno,
the guest memory map has to look like a Juno.
The PCIe root complex uses an internal lookup table to generate
appropriate AXI attributes for outgoing PCIe transactions; unfortunately
this has no notion of 'default' attributes, so addresses *must* match
one of the programmed windows in order to be valid. From memory, EDK2
sets up a 2GB window covering the lower DRAM bank, an 8GB window
covering the upper DRAM bank, and a 1MB (or thereabouts) window covering
the GICv2m region with Device attributes. Any PCIe transactions to
addresses not within one of those windows will be aborted by the RC
without ever going out to the AXI side where the SMMU lies (and I think
anything matching the config space or I/O space windows or a region
claimed by a BAR will be aborted even earlier as a peer-to-peer attempt
regardless of the AXI Translation Table setup).
You could potentially modify the firmware to change the window
configuration, but the alignment restrictions make it awkward. I've only
ever tested passthrough on Juno using kvmtool, which IIRC already has
guest RAM in an appropriate place (and is trivially easy to hack if not)
- I don't remember if I ever actually tried guest MSI with that.
Robin.
But when I want to rollback to use INTx mode I found there have issue
for kvmtool to support INTx mode, so this is why I wrote the patch [1]
to fix the issue. Alternatively, we also can set the NIC driver
module parameter 'sky2.disable_msi=1' thus can totally disable msi and
only use INTx mode.
Anyway, finally I can get INTx mode enabled and I can see the
interrupt will be registered successfully on both host and guest:
Host side:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
41: 0 0 0 0 0 0 GICv2 54 Level arm-pmu
42: 0 0 0 0 0 0 GICv2 58 Level arm-pmu
43: 0 0 0 0 0 0 GICv2 62 Level arm-pmu
45: 772 0 0 0 0 0 GICv2 171 Level vfio-intx(0000:08:00.0)
Guest side:
# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
12: 0 0 0 0 0 0 GIC-0 96 Level eth1
So you could see the host can receive the interrupts, but these
interrupts are mainly triggered before binding vfio-pci driver. But
seems now after launch kvm I can see there have very small mount
interrupts are triggered in host and the guest kernel also can receive
the virtual interrupts, e.g. if use 'dhclient eth1' command in guest
OS, this command stalls for long time (> 1 minute) after return back,
I can see both the host OS and guest OS can receive 5~6 interrupts.
Based on this, I guess the flow for interrupts forwarding has been
enabled. But seems the data packet will not really output and I use
wireshark to capture packets, but cannot find any packet output from
the NIC.
I did another testing is to shrink the memory space/io/bus region to
less than 0x40000000, so this can avoid to put guest memory IPA into
0x40000000. But this doesn't work.
What is worth to try is to move the base address of the guest RAM. I
think there were some recent works on this on kvmtool. Adding
Jean-Philippe in the loop.
Thanks
Eric
@Robin, could you help explain for the hardware issue and review my
methods are feasible on Juno board? Thanks a lot for suggestions.
I will dig more for the memory mapping and post at here.
Thanks,
Leo Yan
[1] https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035055.html
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm