Re: [PATCH v3 1/3] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault

Connor Abbott <cwabbott0@xxxxxxxxx> · Tue, 28 Jan 2025 22:34:41 -0500

On Tue, Jan 28, 2025 at 10:25 PM Rob Clark <robdclark@xxxxxxxxx> wrote:
>
> On Tue, Jan 28, 2025 at 6:31 PM Connor Abbott <cwabbott0@xxxxxxxxx> wrote:
> >
> > On Tue, Jan 28, 2025 at 9:21 PM Rob Clark <robdclark@xxxxxxxxx> wrote:
> > >
> > > On Tue, Jan 28, 2025 at 2:15 PM Connor Abbott <cwabbott0@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, Jan 28, 2025 at 5:10 PM Rob Clark <robdclark@xxxxxxxxx> wrote:
> > > > >
> > > > > On Tue, Jan 28, 2025 at 3:08 AM Prakash Gupta <quic_guptap@xxxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Thu, Jan 23, 2025 at 03:14:16PM -0500, Connor Abbott wrote:
> > > > > > > On Thu, Jan 23, 2025 at 2:26 PM Prakash Gupta <quic_guptap@xxxxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > On Thu, Jan 23, 2025 at 09:00:17AM -0500, Connor Abbott wrote:
> > > > > > > > > On Thu, Jan 23, 2025 at 6:10 AM Prakash Gupta <quic_guptap@xxxxxxxxxxx> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Jan 22, 2025 at 03:00:58PM -0500, Connor Abbott wrote:
> > > > > > > > > > The context would remain stalled till we write to CBn_RESUME. Which is done
> > > > > > > > > > in qcom_adreno_smmu_resume_translation(). For a stalled context further
> > > > > > > > > > transactions are not processed and we shouldn't see further faults and
> > > > > > > > > > or fault inerrupts. Do you observe faults with stalled context?
> > > > > > > > >
> > > > > > > > > Yes. I've observed that on MMU-500 writing RESUME before the interrupt
> > > > > > > > > has been cleared doesn't clear SS. This happened with v2 in the case
> > > > > > > > > where there was already a devcoredump and drm/msm called
> > > > > > > > > qcom_adreno_smmu_resume_translation() immediately from its fault
> > > > > > > > > handler, and we'd get a storm of unhandled interrupts until it was
> > > > > > > > > disabled. Given that the architecture spec says we're supposed to
> > > > > > > > > clear the interrupt first this may have been an attempt to "help"
> > > > > > > > > developers.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Just to clarify, present sequence is:
> > > > > > > > 1. context fault with stall enable. FSR.SS set.
> > > > > > > > 2. Report fault to client
> > > > > > > > 3. resume/terminate stalled transaction
> > > > > > > > 4. clear FSR
> > > > > > > >
> > > > > > > > At what point when you try #2->#3->#4 or #4->#2->#3 sequence, is FSR.SS
> > > > > > > > cleared and interrupt storm is observed.
> > > > > > >
> > > > > > > With #2->#3->#4 FSR.SS is *not* cleared and there is a subsequent
> > > > > > > interrupt storm with only FSR.SS asserted. With #4->#2->#3 there is no
> > > > > > > interrupt storm. From this we conclude that MMU-500 does not clear
> > > > > > > FSR.SS unless #4 happens before #3.
> > > > > > >
> > > > > > Thanks Connor for clarification. I get your point now. I think it's
> > > > > > expected interrupt storm with #2->#3->#4 sequence is expected.  With
> > > > > > CONFIG_ARM_SMMU_QCOM_DEBUG enabled, context fault follows the sequence of
> > > > > > #1->#2->#4->#3, which is spec recommended.
> > > > > >
> > > > > > so, common fault handler can be modified to follow the same sequence, but I
> > > > > > have concern regarding clearing FSR before reporting fault to client.
> > > > > > qcom_adreno_smmu_get_fault_info() is an example I gave but other client
> > > > > > handler may have similar expecation of fault register not changed before
> > > > > > client fault handler is called.
> > > > >
> > > > > Simple solution would be to move the clearing of FSR to after the
> > > > > fault is reported.  It doesn't really matter if it is before or after,
> > > > > as long as it happens before the irq handler returns, AFAIU.  And
> > > > > drm/msm will collect the fault info from the irq handler.
> > > >
> > > > As I said in the earlier mail: "From this we conclude that MMU-500
> > > > does not clear FSR.SS unless #4 happens before #3." #4 is clearing FSR
> > > > and #3 is writing RESUME. So no, unfortunately it does actually matter
> > > > and we get a storm of unhandled IRQs if we don't clear FSR first. Your
> > > > solution is what v2 did and it didn't work.
> > >
> > > So, just clear FSR also in the resume path
> >
> > Then we'd run the risk of incorrectly accepting a second fault if it
> > happens immediately after we resume. Not terrible for our use-case
> > where we only really need the first fault to be accurate and we
> > disable stall-on-fault afterwards, but if you ever leave it on then it
> > would result in another interrupt storm.
> >
>
> Howso? We'd still be ensuring that #4 happens before #3?  If needed we
> can add a param to resume_translation() so drm can tell it to only
> clear FSR if it calls resume_translation from the interrupt handler,
> or something like that.  But I don't want to lose capturing the FSR,
> and I don't think we need to.
>
> BR,
> -R

Because we could clear FSR in the resume handler (ok), then resume,
then fault again, then clear FSR again in arm_smmu_context_fault() and
unintentionally acknowledge the fault without a corresponding resume.
Again, it's not possible with drm/msm after this series since we
disable stall-on-fault before resuming but still this means it's very
fragile.

To be clear, FSR should already be frozen until we resume, if the
SMMUv2 architecture pseudocode is to be believed. So any fault that's
recorded in the devcoredump will still be accurate with this patch,
just subsequent faults might not be accurate.

Connor