Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints

Rob Clark <robdclark@xxxxxxxxx> · Mon, 18 Sep 2017 14:45:21 -0400

On Mon, Sep 18, 2017 at 1:33 PM, Will Deacon <will.deacon@xxxxxxx> wrote:
> On Wed, Sep 13, 2017 at 03:31:20PM -0400, Rob Clark wrote:
>> On Fri, Dec 16, 2016 at 6:54 AM, Will Deacon <will.deacon@xxxxxxx> wrote:
>> > Hi Rob,
>> >
>> > On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
>> >> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon@xxxxxxx> wrote:
>> >> > Enabling stalling faults can result in hardware deadlock on poorly
>> >> > designed systems, particularly those with a PCI root complex upstream of
>> >> > the SMMU.
>> >> >
>> >> > Although it's not really Linux's job to save hardware integrators from
>> >> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> >> > clients) from hosing the system for everybody else, even if they might
>> >> > already be required to have elevated privileges.
>> >> >
>> >> > Given that the fault handling code currently executes entirely in IRQ
>> >> > context, there is nothing that can sensibly be done to recover from
>> >> > things like page faults anyway, so let's rip this code out for now and
>> >> > avoid the potential for deadlock.
>> >>
>> >> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> >> opt-in quirk (ie. disabled by default unless something in DT tells you
>> >> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> >> hw was having problems due to this feature.)
>> >>
>> >> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> >> halting the GPU so the driver's fault handler can dump some GPU state
>> >> on faults is enormously helpful for debugging and tracking down where
>> >> in the gpu cmdstream the fault was triggered.  In addition, we will
>> >> eventually want the ability to update pagetables from fault handler
>> >> and resuming the faulting transition.
>> >
>> > I'm not against reintroducing this, but it would certainly need to be
>> > opt-in, as you suggest. If we want to try to use stall faults to enable
>> > demand paging for DMA, then that means running core mm code to resolve
>> > the fault and we can't do that in irq context. Consequently, we have to
>> > hand this off to a thread, which means the hardware must allow the SS
>> > bit to remain set without immediately reasserting the interrupt line.
>> > Furthermore, we can't handle multiple faults on a context-bank, so we'd
>> > need to restrict ourselves to one device (i.e. faulting stream) per
>> > domain (CB).
>> >
>> > I think that means we want both specific compatible strings to identify
>> > the SS bit behaviour, but also a way to opt-in for the stall model as a
>> > separate property to indicate that the SoC integration can support this
>> > without e.g. deadlocking.
>> >
>>
>> How do you feel about a short-term step to keep things in irq context,
>> but enable stall mode?  I'm debugging an issue, where it appears that
>> the gpu cannot handle a non-stalled fault (triggers some fairly
>> bizarre follow-on problems).  So I think even if we don't add a fault
>> handler callback, we still want to set the CFCFG bit and
>> RESUME_TERMINATE in the fault handler on this hardware.
>
> Hmm, colour me unconvinced. Why does enabling stalls fix this problem?
> I'd rather we bite the bullet and implement things properly on top of
> a workqueue so that you can build on the same basic infrastructure as
> the SVM work that Jean-Philippe is looking at, particular as you also
> have a use-case for running fault code in non-atomic context.
>

So, it seems like setting either CFCFG or HUPCF (which is what
downstream android kernel apparently does) avoids this issue.

It seems like some sort of hw bug, but not sure if in the iommu or the
gpu.. you'd probably have to ask someone @qcom about the details, but
without setting either of these bits, it seems that other concurrent
memory transactions to the one triggering fault and up returning bogus
values to the GPU.  So the CP (which is reading cmdstream somewhat far
ahead of the DRAW or COMPUTE shader that triggers a fault) ends up
reading bogus cmdstream values and triggers all sorts of spectacular
fireworks.

BR,
-R
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html