On 12/24/2023 6:47 PM, Lukas Wunner wrote:
On Sun, Dec 24, 2023 at 12:06:57AM -0500, Ethan Zhao wrote:
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1423,6 +1423,13 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
writel(qi->free_head << shift, iommu->reg + DMAR_IQT_REG);
while (qi->desc_status[wait_index] != QI_DONE) {
+ /*
+ * if the devTLB invalidation target device is gone, don't wait
+ * anymore, it might take up to 1min+50%, causes system hang.
+ */
+ if (type == QI_DIOTLB_TYPE && iommu->flush_target_dev)
+ if (!pci_device_is_present(to_pci_dev(iommu->flush_target_dev)))
+ break;
As a general approach, this is much better now.
Please combine the nested if-clauses into one.
That would be harder to read ?
Please amend the code comment with a spec reference, i.e.
"(see Implementation Note in PCIe r6.1 sec 10.3.1)"
so that readers of the code know where the magic number "1min+50%"
is coming from.
Yup.
Is flush_target_dev guaranteed to always be a pci_dev?
yes, as Baolu said, only PCI and ATS capable device supports
devTLB invalidation operation, this is checked by its caller path.
I'll let iommu maintainers comment on whether storing a flush_target_dev
pointer is the right approach. (May store a back pointer from
struct intel_iommu to struct device_domain_info?)
One of them, wonder which one is better, but device_domain_info
is still per device...seems no good to back it there.
Maybe move the "to_pci_dev(iommu->flush_target_dev)" lookup outside the
loop to avoid doing this over and over again?
hmm. that is a macro renam of container_of(), exactly, doesn't matter.
right ?
I think we still have a problem here if the device is not removed
but simply takes a long time to respond to Invalidate Requests
(as it is permitted to do per the Implementation Note). We'll
busy-wait for the completion and potentially run into the watchdog's
time limit again. So I think you or someone else in your org should
add OKRs to refactor the code so that it sleeps in-between polling
refactor code would be long story, so far still a quick fix for the issue.
and I think developers have other justifiction or conern about the
non-sync version, once again, thanks for your comment.
regards,
Ethan
for Invalidate Completions (instead of busy-waiting with interrupts
disabled).
Thanks,
Lukas