Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected

Ethan Zhao <haifeng.zhao@xxxxxxxxxxxxxxx> · Thu, 14 Dec 2023 10:40:20 +0800

On 12/13/2023 7:54 PM, Robin Murphy wrote:
On 13/12/2023 10:44 am, Lukas Wunner wrote:
On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
For those endpoint devices connect to system via hotplug capable ports,
users could request a warm reset to the device by flapping device's 
link
through setting the slot's link control register,

Well, users could just *unplug* the device, right?  Why is it relevant
that thay could fiddle with registers in config space?

as pciehpt_ist() DLLSC
interrupt sequence response, pciehp will unload the device driver and
then power it off. thus cause an IOMMU devTLB flush request for 
device to
be sent and a long time completion/timeout waiting in interrupt 
context.

A completion timeout should be on the order of usecs or msecs, why 
does it
cause a hard lockup?  The dmesg excerpt you've provided shows a 12 
*second*
delay between hot removal and watchdog reaction.

The PCIe spec only requires an endpoint to respond to an ATS 
invalidate within a rather hilarious 90 seconds, so it's primarily a 
question of how patient the root complex and bridges in between are 
prepared to be.

The issue reported only reproduce with endpoint device connects to 
system via PCIe switch (only has read tracking feature), those switchses 
seem not be aware of ATS transaction. while root port is aware of ATS

while the ATS transaction is broken. (invalidation sent, but link down, 
device removed etc). but I didn't find any public spec about that.

Fix it by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB 
flush
request to link down device that is set to 
pci_channel_io_perm_failure and
then powered off in

This doesn't seem to be a proper fix.  It will work most of the time
but not always.  A user might bring down the slot via sysfs, then yank
the card from the slot just when the iommu flush occurs such that the
pci_dev_is_disconnected(pdev) check returns false but the card is
physically gone immediately afterwards.  In other words, you've shrunk
the time window during which the issue may occur, but haven't eliminated
it completely.

Yeah, I think we have a subtle but fundamental issue here in that the 
iommu_release_device() callback is hooked to 
BUS_NOTIFY_REMOVED_DEVICE, so in general probably shouldn't be 
assuming it's safe to do anything with the device itself *after* it's 
already been removed from its bus - this step is primarily about 
cleaning up any of the IOMMU's own state relating to the given device.

I think if we want to ensure ATCs are invalidated on hot-unplug we 
need an additional pre-removal notifier to take care of that, and that 
step would then want to distinguish between an orderly removal where 
cleaning up is somewhat meaningful, and a surprise removal where it 
definitely isn't.

So, at least, we should check device state before issue devTLB invaliation.

Thanks,

Ethan

Thanks,
Robin.