Re: [PATCH v3 1/3] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault

Rob Clark <robdclark@xxxxxxxxx> · Tue, 28 Jan 2025 19:25:25 -0800

On Tue, Jan 28, 2025 at 6:31 PM Connor Abbott <cwabbott0@xxxxxxxxx> wrote:
>
> On Tue, Jan 28, 2025 at 9:21 PM Rob Clark <robdclark@xxxxxxxxx> wrote:
> >
> > On Tue, Jan 28, 2025 at 2:15 PM Connor Abbott <cwabbott0@xxxxxxxxx> wrote:
> > >
> > > On Tue, Jan 28, 2025 at 5:10 PM Rob Clark <robdclark@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, Jan 28, 2025 at 3:08 AM Prakash Gupta <quic_guptap@xxxxxxxxxxx> wrote:
> > > > >
> > > > > On Thu, Jan 23, 2025 at 03:14:16PM -0500, Connor Abbott wrote:
> > > > > > On Thu, Jan 23, 2025 at 2:26 PM Prakash Gupta <quic_guptap@xxxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On Thu, Jan 23, 2025 at 09:00:17AM -0500, Connor Abbott wrote:
> > > > > > > > On Thu, Jan 23, 2025 at 6:10 AM Prakash Gupta <quic_guptap@xxxxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jan 22, 2025 at 03:00:58PM -0500, Connor Abbott wrote:
> > > > > > > > > The context would remain stalled till we write to CBn_RESUME. Which is done
> > > > > > > > > in qcom_adreno_smmu_resume_translation(). For a stalled context further
> > > > > > > > > transactions are not processed and we shouldn't see further faults and
> > > > > > > > > or fault inerrupts. Do you observe faults with stalled context?
> > > > > > > >
> > > > > > > > Yes. I've observed that on MMU-500 writing RESUME before the interrupt
> > > > > > > > has been cleared doesn't clear SS. This happened with v2 in the case
> > > > > > > > where there was already a devcoredump and drm/msm called
> > > > > > > > qcom_adreno_smmu_resume_translation() immediately from its fault
> > > > > > > > handler, and we'd get a storm of unhandled interrupts until it was
> > > > > > > > disabled. Given that the architecture spec says we're supposed to
> > > > > > > > clear the interrupt first this may have been an attempt to "help"
> > > > > > > > developers.
> > > > > > > >
> > > > > > >
> > > > > > > Just to clarify, present sequence is:
> > > > > > > 1. context fault with stall enable. FSR.SS set.
> > > > > > > 2. Report fault to client
> > > > > > > 3. resume/terminate stalled transaction
> > > > > > > 4. clear FSR
> > > > > > >
> > > > > > > At what point when you try #2->#3->#4 or #4->#2->#3 sequence, is FSR.SS
> > > > > > > cleared and interrupt storm is observed.
> > > > > >
> > > > > > With #2->#3->#4 FSR.SS is *not* cleared and there is a subsequent
> > > > > > interrupt storm with only FSR.SS asserted. With #4->#2->#3 there is no
> > > > > > interrupt storm. From this we conclude that MMU-500 does not clear
> > > > > > FSR.SS unless #4 happens before #3.
> > > > > >
> > > > > Thanks Connor for clarification. I get your point now. I think it's
> > > > > expected interrupt storm with #2->#3->#4 sequence is expected.  With
> > > > > CONFIG_ARM_SMMU_QCOM_DEBUG enabled, context fault follows the sequence of
> > > > > #1->#2->#4->#3, which is spec recommended.
> > > > >
> > > > > so, common fault handler can be modified to follow the same sequence, but I
> > > > > have concern regarding clearing FSR before reporting fault to client.
> > > > > qcom_adreno_smmu_get_fault_info() is an example I gave but other client
> > > > > handler may have similar expecation of fault register not changed before
> > > > > client fault handler is called.
> > > >
> > > > Simple solution would be to move the clearing of FSR to after the
> > > > fault is reported.  It doesn't really matter if it is before or after,
> > > > as long as it happens before the irq handler returns, AFAIU.  And
> > > > drm/msm will collect the fault info from the irq handler.
> > >
> > > As I said in the earlier mail: "From this we conclude that MMU-500
> > > does not clear FSR.SS unless #4 happens before #3." #4 is clearing FSR
> > > and #3 is writing RESUME. So no, unfortunately it does actually matter
> > > and we get a storm of unhandled IRQs if we don't clear FSR first. Your
> > > solution is what v2 did and it didn't work.
> >
> > So, just clear FSR also in the resume path
>
> Then we'd run the risk of incorrectly accepting a second fault if it
> happens immediately after we resume. Not terrible for our use-case
> where we only really need the first fault to be accurate and we
> disable stall-on-fault afterwards, but if you ever leave it on then it
> would result in another interrupt storm.
>

Howso? We'd still be ensuring that #4 happens before #3?  If needed we
can add a param to resume_translation() so drm can tell it to only
clear FSR if it calls resume_translation from the interrupt handler,
or something like that.  But I don't want to lose capturing the FSR,
and I don't think we need to.

BR,
-R