On Thu, May 07, 2020 at 11:55:54AM +0100, Robin Murphy wrote: > On 2020-05-07 11:14 am, Sai Prakash Ranjan wrote: > > On 2020-04-22 01:50, Sai Prakash Ranjan wrote: > > > Add stall implementation hook to enable stalling > > > faults on QCOM platforms which supports it without > > > causing any kind of hardware mishaps. Without this > > > on QCOM platforms, GPU faults can cause unrelated > > > GPU memory accesses to return zeroes. This has the > > > unfortunate result of command-stream reads from CP > > > getting invalid data, causing a cascade of fail. > > I think this came up before, but something about this rationale doesn't add > up - we're not *using* stalls at all, we're still terminating faulting > transactions unconditionally; we're just using CFCFG to terminate them with > a slight delay, rather than immediately. It's really not clear how or why > that makes a difference. Is it a GPU bug? Or an SMMU bug? Is this reliable > (or even a documented workaround for something), or might things start > blowing up again if any other behaviour subtly changes? I'm not dead set > against adding this, but I'd *really* like to have a lot more confidence in > it. Rob mentioned something about the "bus returning zeroes" before, but I agree that we need more information so that we can reason about this and maintain the code as the driver continues to change. That needs to be a comment in the driver, and I don't think "but android seems to work" is a good enough justification. There was some interaction with HUPCF as well. As a template, I'd suggest: > > > diff --git a/drivers/iommu/arm-smmu.h b/drivers/iommu/arm-smmu.h > > > index 8d1cd54d82a6..d5134e0d5cce 100644 > > > --- a/drivers/iommu/arm-smmu.h > > > +++ b/drivers/iommu/arm-smmu.h > > > @@ -386,6 +386,7 @@ struct arm_smmu_impl { > > > int (*init_context)(struct arm_smmu_domain *smmu_domain); > > > void (*tlb_sync)(struct arm_smmu_device *smmu, int page, int sync, > > > int status); /* * Stall transactions on a context fault, where they will be terminated * in response to the resulting IRQ rather than immediately. This should * pretty much always be set to "false" as stalling can introduce the * potential for deadlock in most SoCs, however it is needed on Qualcomm * XXXX because YYYY. */ > > > + bool stall; Hmm, the more I think about this, the more I think this is an erratum workaround in disguise, in which case this could be better named... Will