On Fri, Dec 16, 2016 at 6:54 AM, Will Deacon <will.deacon@xxxxxxx> wrote: > Hi Rob, > > On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote: >> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon@xxxxxxx> wrote: >> > Enabling stalling faults can result in hardware deadlock on poorly >> > designed systems, particularly those with a PCI root complex upstream of >> > the SMMU. >> > >> > Although it's not really Linux's job to save hardware integrators from >> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO >> > clients) from hosing the system for everybody else, even if they might >> > already be required to have elevated privileges. >> > >> > Given that the fault handling code currently executes entirely in IRQ >> > context, there is nothing that can sensibly be done to recover from >> > things like page faults anyway, so let's rip this code out for now and >> > avoid the potential for deadlock. >> >> so, I'd like to re-introduce this feature, I *guess* as some sort of >> opt-in quirk (ie. disabled by default unless something in DT tells you >> otherwise?? But I'm open to suggestions. I'm not entirely sure what >> hw was having problems due to this feature.) >> >> On newer snapdragon devices we are using arm-smmu for the GPU, and >> halting the GPU so the driver's fault handler can dump some GPU state >> on faults is enormously helpful for debugging and tracking down where >> in the gpu cmdstream the fault was triggered. In addition, we will >> eventually want the ability to update pagetables from fault handler >> and resuming the faulting transition. > > I'm not against reintroducing this, but it would certainly need to be > opt-in, as you suggest. If we want to try to use stall faults to enable > demand paging for DMA, then that means running core mm code to resolve > the fault and we can't do that in irq context. Consequently, we have to > hand this off to a thread, which means the hardware must allow the SS > bit to remain set without immediately reasserting the interrupt line. > Furthermore, we can't handle multiple faults on a context-bank, so we'd > need to restrict ourselves to one device (i.e. faulting stream) per > domain (CB). > > I think that means we want both specific compatible strings to identify > the SS bit behaviour, but also a way to opt-in for the stall model as a > separate property to indicate that the SoC integration can support this > without e.g. deadlocking. > How do you feel about a short-term step to keep things in irq context, but enable stall mode? I'm debugging an issue, where it appears that the gpu cannot handle a non-stalled fault (triggers some fairly bizarre follow-on problems). So I think even if we don't add a fault handler callback, we still want to set the CFCFG bit and RESUME_TERMINATE in the fault handler on this hardware. BR, -R -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html