On Tue, Jan 07, 2025 at 07:26:44AM -0800, Rob Clark wrote: > On Tue, Jan 7, 2025 at 4:57 AM Will Deacon <will@xxxxxxxxxx> wrote: > > > > On Thu, Jan 02, 2025 at 10:32:31AM -0800, Rob Clark wrote: > > > From: Rob Clark <robdclark@xxxxxxxxxxxx> > > > > > > On mmu-500, stall-on-fault seems to stall all context banks, causing the > > > GMU to misbehave. So limit this feature to smmu-v2 for now. > > > > > > This fixes an issue with an older mesa bug taking outo the system > > > because of GMU going off into the weeds. > > > > > > What we _think_ is happening is that, if the GPU generates 1000's of > > > faults at ~once (which is something that GPUs can be good at), it can > > > result in a sufficient number of stalled translations preventing other > > > transactions from entering the same TBU. > > > > MMU-500 is an implementation of the SMMUv2 architecture, so this feels > > upside-down to me. That is, it should always be valid to probe with > > the less specific "SMMUv2" compatible string (modulo hardware errata) > > and be limited to the architectural behaviour. > > I should have been more specific and referred to qcom,smmu-v2 > > > So what is about MMU-500 that means stalling doesn't work when compared > > to any other SMMUv2 implementation? > > Well, I have a limited # of data points, in the sense that there > aren't too many a6xx devices prior to the switch to qcom,smmu-500.. > but I have access to crash metrics for a lot of sc7180 devices > (qcom,smmu-v2), and I've been unable to find any signs of this sort of > stall related issue. > > So maybe I can't 100% say this is qcom,smmu-500 vs qcom,smmu-v2, vs > some other change in later gens that used qcom,smmu-500 or some other > factor, I'm not sure what other conclusion to draw. Might it be that v2 was an actual hw, but mmu-500 is somehow virtualized? And as such by these stalls we might be observing some kind of FW bug in hyp? > > BR, > -R -- With best wishes Dmitry