On Tue, Jan 7, 2025 at 4:57 AM Will Deacon <will@xxxxxxxxxx> wrote: > > On Thu, Jan 02, 2025 at 10:32:31AM -0800, Rob Clark wrote: > > From: Rob Clark <robdclark@xxxxxxxxxxxx> > > > > On mmu-500, stall-on-fault seems to stall all context banks, causing the > > GMU to misbehave. So limit this feature to smmu-v2 for now. > > > > This fixes an issue with an older mesa bug taking outo the system > > because of GMU going off into the weeds. > > > > What we _think_ is happening is that, if the GPU generates 1000's of > > faults at ~once (which is something that GPUs can be good at), it can > > result in a sufficient number of stalled translations preventing other > > transactions from entering the same TBU. > > MMU-500 is an implementation of the SMMUv2 architecture, so this feels > upside-down to me. That is, it should always be valid to probe with > the less specific "SMMUv2" compatible string (modulo hardware errata) > and be limited to the architectural behaviour. I should have been more specific and referred to qcom,smmu-v2 > So what is about MMU-500 that means stalling doesn't work when compared > to any other SMMUv2 implementation? Well, I have a limited # of data points, in the sense that there aren't too many a6xx devices prior to the switch to qcom,smmu-500.. but I have access to crash metrics for a lot of sc7180 devices (qcom,smmu-v2), and I've been unable to find any signs of this sort of stall related issue. So maybe I can't 100% say this is qcom,smmu-500 vs qcom,smmu-v2, vs some other change in later gens that used qcom,smmu-500 or some other factor, I'm not sure what other conclusion to draw. BR, -R