Re: arm64: Getting continuous PCIe "CmpltTO" AER from network card in kdump kernel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2020-03-26 1:36 pm, Prabhakar Kushwaha wrote:
On Mon, Mar 23, 2020 at 10:28 PM Robin Murphy <robin.murphy@xxxxxxx> wrote:

On 2020-03-23 3:21 pm, Prabhakar Kushwaha wrote:
Hi All,

I am facing issue on Marvell's ARM64 Thunder X2 with kdump kernel.
Here network card is continuously giving following AER error
[  100.839168] igb 0000:09:00.1: AER: aer_status: 0x00004000,
aer_mask: 0x00000000
[  100.846463] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
[  100.861491] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
aer_agent=Requester ID
[  100.869400] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011

This error is not 100% reproducible. It happens 1 out of 4 try.

This error goes away in following two scenarios
A) Set iommu in bypass mode via bootargs iommu.passthrough=1
B) Wait for ~100ms in arm_smmu_device_reset of  drivers/iommu/arm-smmu-v3.c
          if (reg & CR0_SMMUEN) {
                  dev_warn(smmu->dev, "SMMU currently enabled! Resetting...\n");
                  WARN_ON(is_kdump_kernel() && !disable_bypass);
                  mdelay(100);  <-- Added delay
                  arm_smmu_update_gbpa(smmu, GBPA_ABORT, 0);
          }

  From A), it is clear that it is related to IOMMU
  From B), looks like during boot of kdump kernel, network card is still
active and it has sent some request over PCIe.
as GPBA_ABORT bit is set, no response/completion coming to PCIe
controller hence "CmpltTO" error.

Ideally before setting GPBA_ABORT bit, there should be some check for
active transaction. if it is not possible, a wait should be done to
assure that no more pending transaction left.

In general there is no way to check for active transactions, and even if
there were, waiting for them to finish could mean waiting forever (if,
say, a device is continuously streaming to/from a ring buffer).

why any such delay has not been considered?

The main aim here is to block any DMA left over from the crashed kernel
as quickly as possible, to minimise any further potential corruption of
memory (consider if a device was left writing to an IOMMU virtual
address that happened to have the same value as some physical address in
the crash kernel's reserved memory). The fact that an arbitrary delay
happens to give a 'nicer' result in one particular situation on one
particular platform is neither here nor there in general.


I agree.
But we are depending upon kdump boot time and expecting devices to
reach to idle state before setting GBPA_ABORT bit.

So (ideally) stop depending on that, because like I said it's fragile and doesn't generalise.

adding a delay will be fair and make it independent of kdump boot time.

And what delay value is "fair" and appropriate for any device on any system in any circumstance?

Besides, this is *crash* kernel, so yeah, expect errors - something's
already gone badly wrong to get us here, and everything from then on is
merely a best-effort attempt to salvage what we can. Does it even make
sense to have AER enabled at this point?


i tried by disabling AER in kdump kernel. but it did not helped as
network device become out of sync with respect to tx unit causing it
to be hanged and it never recovered from there.  Same can happen with
other devices like SATA etc

Any devices that the kdump kernel wants to use need to be fully reset to get them into a sane state anyway, don't they? I mean, what if the crash was *caused* by once of those devices going wrong in the first place? Any devices that kdump *doesn't* care about shouldn't matter, since nothing should be unmasking their interrupts regardless of what state they're in.

Assume some descriptor or pagetable entry got corrupted that caused your network device to access an invalid physical address downstream of the SMMU and get an abort from that *before* the kdump kernel starts - is waiting an extra 100ms at any point after that going to help?

Robin.

_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec



[Index of Archives]     [LM Sensors]     [Linux Sound]     [ALSA Users]     [ALSA Devel]     [Linux Audio Users]     [Linux Media]     [Kernel]     [Gimp]     [Yosemite News]     [Linux Media]

  Powered by Linux