> From: Dexuan Cui > Sent: Wednesday, August 18, 2021 2:08 PM > > > From: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > > Sent: Wednesday, July 21, 2021 2:17 PM > > To: Dexuan Cui <decui@xxxxxxxxxxxxx>; Saeed Mahameed > > > > On Mon, Jul 19 2021 at 20:33, Dexuan Cui wrote: > > > This is a bare metal x86-64 host with Intel CPUs. Yes, I believe the > > > issue is in the IOMMU Interrupt Remapping mechanism rather in the > > > NIC driver. I just don't understand why bringing the CPUs online and > > > offline can work around the issue. I'm trying to dump the IOMMU IR > > > table entries to look for any error. > > > > can you please enable GENERIC_IRQ_DEBUGFS and provide the output of > > > > cat /sys/kernel/debug/irq/irqs/$THENICIRQS > > > > Thanks, > > > > tglx > > Sorry for the late response! I checked the below sys file, and the output is > exactly the same in the good/bad cases -- in both cases, I use maxcpus=8; > the only difference in the good case is that I online and then offline CPU 8~31: > for i in `seq 8 31`; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done > for i in `seq 8 31`; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done > > # cat /sys/kernel/debug/irq/irqs/209 > ... I tried the kernel parameter "intremap=nosid,no_x2apic_optout,nopost" but it didn't help. Only "intremap=off" can work round the no interrupt issue. When the no interrupt issue happens, irq 209's effective_affinity_list is 5. I modified modify_irte() to print the irte->low, irte->high, and I also printed the irte_index for irq 209, and they were all normal to me, and they were exactly the same in the bad case and the good case -- it looks like, with "intremap=on maxcpus=8", MSI-X on CPU5 can't work for the NIC device (MSI-X on CPU5 works for other devices like a NVMe controller) , and somehow "onlining and then offlining CPU 8~31" can "fix" the issue, which is really weird. Thanks, Dexuan