On Tue, Jun 18, 2024 at 8:47 AM Pranjal Shrivastava <praan@xxxxxxxxxx> wrote: > > On Tue, Jun 18, 2024 at 8:28 PM Rob Clark <robdclark@xxxxxxxxx> wrote: > > > > On Mon, Jun 17, 2024 at 10:33 AM Robin Murphy <robin.murphy@xxxxxxx> wrote: > > > > > > On 2024-06-17 5:18 pm, Rob Clark wrote: > > > > On Mon, Jun 17, 2024 at 6:07 AM Robin Murphy <robin.murphy@xxxxxxx> wrote: > > > >> > > > >> On 04/06/2024 4:01 pm, Rob Clark wrote: > > > >>> From: Rob Clark <robdclark@xxxxxxxxxxxx> > > > >>> > > > >>> Parse out the bitfields for easier-to-read fault messages. > > > >>> > > > >>> Signed-off-by: Rob Clark <robdclark@xxxxxxxxxxxx> > > > >>> --- > > > >>> Stephen was wanting easier to read fault messages.. so I typed this up. > > > >>> > > > >>> Resend with the new iommu list address > > > >>> > > > >>> drivers/iommu/arm/arm-smmu/arm-smmu.c | 53 +++++++++++++++++++++++++-- > > > >>> drivers/iommu/arm/arm-smmu/arm-smmu.h | 5 +++ > > > >>> 2 files changed, 54 insertions(+), 4 deletions(-) > > > >>> > > > >>> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c > > > >>> index c572d877b0e1..06712d73519c 100644 > > > >>> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c > > > >>> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c > > > >>> @@ -411,6 +411,8 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev) > > > >>> unsigned long iova; > > > >>> struct arm_smmu_domain *smmu_domain = dev; > > > >>> struct arm_smmu_device *smmu = smmu_domain->smmu; > > > >>> + static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL, > > > >>> + DEFAULT_RATELIMIT_BURST); > > > >>> int idx = smmu_domain->cfg.cbndx; > > > >>> int ret; > > > >>> > > > >>> @@ -425,10 +427,53 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev) > > > >>> ret = report_iommu_fault(&smmu_domain->domain, NULL, iova, > > > >>> fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ); > > > >>> > > > >>> - if (ret == -ENOSYS) > > > >>> - dev_err_ratelimited(smmu->dev, > > > >>> - "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n", > > > >>> - fsr, iova, fsynr, cbfrsynra, idx); > > > >>> + if (ret == -ENOSYS && __ratelimit(&rs)) { > > > >>> + static const struct { > > > >>> + u32 mask; const char *name; > > > >>> + } fsr_bits[] = { > > > >>> + { ARM_SMMU_FSR_MULTI, "MULTI" }, > > > >>> + { ARM_SMMU_FSR_SS, "SS" }, > > > >>> + { ARM_SMMU_FSR_UUT, "UUT" }, > > > >>> + { ARM_SMMU_FSR_ASF, "ASF" }, > > > >>> + { ARM_SMMU_FSR_TLBLKF, "TLBLKF" }, > > > >>> + { ARM_SMMU_FSR_TLBMCF, "TLBMCF" }, > > > >>> + { ARM_SMMU_FSR_EF, "EF" }, > > > >>> + { ARM_SMMU_FSR_PF, "PF" }, > > > >>> + { ARM_SMMU_FSR_AFF, "AFF" }, > > > >>> + { ARM_SMMU_FSR_TF, "TF" }, > > > >>> + }, fsynr0_bits[] = { > > > >>> + { ARM_SMMU_FSYNR0_WNR, "WNR" }, > > > >>> + { ARM_SMMU_FSYNR0_PNU, "PNU" }, > > > >>> + { ARM_SMMU_FSYNR0_IND, "IND" }, > > > >>> + { ARM_SMMU_FSYNR0_NSATTR, "NSATTR" }, > > > >>> + { ARM_SMMU_FSYNR0_PTWF, "PTWF" }, > > > >>> + { ARM_SMMU_FSYNR0_AFR, "AFR" }, > > > >>> + }; > > > >>> + > > > >>> + pr_err("%s %s: Unhandled context fault: fsr=0x%x (", > > > >>> + dev_driver_string(smmu->dev), dev_name(smmu->dev), fsr); > > > >>> + > > > >>> + for (int i = 0, n = 0; i < ARRAY_SIZE(fsr_bits); i++) { > > > >>> + if (fsr & fsr_bits[i].mask) { > > > >>> + pr_cont("%s%s", (n > 0) ? "|" : "", fsr_bits[i].name); > > > >> > > > >> Given that SMMU faults have a high likelihood of correlating with other > > > >> errors, e.g. the initiating device also reporting that it got an abort > > > >> back, this much pr_cont is a recipe for an unreadable mess. Furthermore, > > > >> just imagine how "helpful" this would be when faults in two contexts are > > > >> reported by two different CPUs at the same time ;) > > > > > > > > It looks like arm_smmu_context_fault() is only used with non-threaded > > > > irq's. And this fallback is only used if driver doesn't register it's > > > > own fault handler. So I don't think this will be a problem. > > > > > > You don't think two different IRQs can fire on two different CPUs at the > > > same time (or close to)? It's already bad enough when multiple CPUs > > > panic and one has to pick apart line-by-line interleaving of the > > > registers/stacktraces - imagine how much more utterly unusable that > > > would be with bits of different dumps randomly pr_cont'ed together onto > > > the same line(s)... > > > > _different_ irq's, yes > > > > Anyways, the reason for pr_cont() was that there wasn't another > > reasonable way to decide where separator chars were needed with a > > single pr_err(). I could instead construct a string on stack and > > print that in a single call, but pr_cont() seemed like the more > > reasonable alternative. > > > > BR, > > -R > > The string approach sounds good to me, if possible, let's break this > out into a helper function, something like `arm_smmu_log_ctx_fault` > and put it under a module parameter, I guess? Not sure if this > requires a new Kconfig option, would like Robin's opinion on this. I did notice that qcom_smmu_context_fault() appeared recently, also adding similar pretty-print.. but only for things with a tbu driver (and a bit more open coded). So a helper would probably make sense. Less sure about mod param or Kconfig, but I tend to be of the opinion that the kernel already has too much configurability.. others may have a different view. BR, -R > Thanks, > Pranjal > > > > > > Even when unrelated stuff gets interleaved because other CPUs just > > > happen to be calling printk() at the same time for unrelated reasons > > > it's still annoying, and pr_cont makes a bigger mess than not. > > > >> I'd prefer to retain the original message as-is, so there is at least > > > >> still an unambiguous "atomic" view of a fault's entire state, then > > > >> follow it with a decode more in the style of arm64's ESR logging. TBH I > > > >> also wouldn't disapprove of hiding the additional decode behind a > > > >> command-line/runtime parameter, since a fault storm can cripple a system > > > >> enough as it is, without making the interrupt handler spend even longer > > > >> printing to a potentially slow console. > > > > > > > > It _is_ ratelimited. But we could perhaps use a higher loglevel (pr_debug?) > > > > > > Yeah, I'd have no complaint with pr_debug/dev_dbg either, if that suits > > > your use case. True that the ratelimit may typically mitigate the > > > overall impact, but still in the worst case with a sufficiently slow > > > console and/or a sufficiently large amount to print per __ratelimit() > > > call, it can end up being slow enough to stay below the threshold. Don't > > > ask me how I know that :) > > > > > > Thanks, > > > Robin.