On Tue, May 19, 2020 at 2:26 AM Sai Prakash Ranjan <saiprakash.ranjan@xxxxxxxxxxxxxx> wrote: > > Hi Will, > > On 2020-05-18 21:15, Will Deacon wrote: > > On Mon, May 11, 2020 at 11:30:08AM -0600, Jordan Crouse wrote: > >> On Fri, May 08, 2020 at 08:40:40AM -0700, Rob Clark wrote: > >> > On Fri, May 8, 2020 at 8:32 AM Rob Clark <robdclark@xxxxxxxxx> wrote: > >> > > > >> > > On Thu, May 7, 2020 at 5:54 AM Will Deacon <will@xxxxxxxxxx> wrote: > >> > > > > >> > > > On Thu, May 07, 2020 at 11:55:54AM +0100, Robin Murphy wrote: > >> > > > > On 2020-05-07 11:14 am, Sai Prakash Ranjan wrote: > >> > > > > > On 2020-04-22 01:50, Sai Prakash Ranjan wrote: > >> > > > > > > Add stall implementation hook to enable stalling > >> > > > > > > faults on QCOM platforms which supports it without > >> > > > > > > causing any kind of hardware mishaps. Without this > >> > > > > > > on QCOM platforms, GPU faults can cause unrelated > >> > > > > > > GPU memory accesses to return zeroes. This has the > >> > > > > > > unfortunate result of command-stream reads from CP > >> > > > > > > getting invalid data, causing a cascade of fail. > >> > > > > > >> > > > > I think this came up before, but something about this rationale doesn't add > >> > > > > up - we're not *using* stalls at all, we're still terminating faulting > >> > > > > transactions unconditionally; we're just using CFCFG to terminate them with > >> > > > > a slight delay, rather than immediately. It's really not clear how or why > >> > > > > that makes a difference. Is it a GPU bug? Or an SMMU bug? Is this reliable > >> > > > > (or even a documented workaround for something), or might things start > >> > > > > blowing up again if any other behaviour subtly changes? I'm not dead set > >> > > > > against adding this, but I'd *really* like to have a lot more confidence in > >> > > > > it. > >> > > > > >> > > > Rob mentioned something about the "bus returning zeroes" before, but I agree > >> > > > that we need more information so that we can reason about this and maintain > >> > > > the code as the driver continues to change. That needs to be a comment in > >> > > > the driver, and I don't think "but android seems to work" is a good enough > >> > > > justification. There was some interaction with HUPCF as well. > >> > > > >> > > The issue is that there are multiple parallel memory accesses > >> > > happening at the same time, for example CP (the cmdstream processor) > >> > > will be reading ahead and setting things up for the next draw or > >> > > compute grid, in parallel with some memory accesses from the shader > >> > > which could trigger a fault. (And with faults triggered by something > >> > > in the shader, there are *many* shader threads running in parallel so > >> > > those tend to generate a big number of faults at the same time.) > >> > > > >> > > We need either CFCFG or HUPCF, otherwise what I have observed is that > >> > > while the fault happens, CP's memory access will start returning > >> > > zero's instead of valid cmdstream data, which triggers a GPU hang. I > >> > > can't say whether this is something unique to qcom's implementation of > >> > > the smmu spec or not. > >> > > > >> > > *Often* a fault is the result of the usermode gl/vk/cl driver bug, > >> > > although I don't think that is an argument against fixing this in the > >> > > smmu driver.. I've been carrying around a local patch to set HUPCF for > >> > > *years* because debugging usermode driver issues is so much harder > >> > > without. But there are some APIs where faults can be caused by the > >> > > user's app on top of the usermode driver. > >> > > > >> > > >> > Also, I'll add to that, a big wish of mine is to have stall with the > >> > ability to resume later from a wq context. That would enable me to > >> > hook in the gpu crash dump handling for faults, which would make > >> > debugging these sorts of issues much easier. I think I posted a > >> > prototype of this quite some time back, which would schedule a worker > >> > on the first fault (since there are cases where you see 1000's of > >> > faults at once), which grabbed some information about the currently > >> > executing submit and some gpu registers to indicate *where* in the > >> > submit (a single submit could have 100's or 1000's of draws), and then > >> > resumed the iommu cb. > >> > > >> > (This would ofc eventually be useful for svm type things.. I expect > >> > we'll eventually care about that too.) > >> > >> Rob is right about HUPCF. Due to the parallel nature of the command > >> processor > >> there is always a very good chance that a CP access is somewhere in > >> the bus so > >> any pagefault is usually a death sentence. The GPU context bank would > >> always > >> want HUPCF set to 1. > > > > So this sounds like an erratum to me, and I'm happy to set HUPCF if we > > detect the broken implementation. However, it will need an entry in > > Documentation/arm64/silicon-errata.rst and a decent comment in the > > driver > > to explain what we're doing and why. > > > > AFAIK there is no erratum documented internally for this behaviour and > this > exists from MSM8996 SoC time and errata usually don't survive this long > across generation of SoCs and there is no point for us in disguising it. possibly longer, qcom_iommu sets CFCFG.. > Is it OK if we clearly mention it as the "design limitation" or some > other > term which we can agree upon along with the description which Rob and > Jordan > provided for setting HUPCF in the driver when we add the set_hupcf > callback? I'm not too picky on what we call it, but afaict it has been this way since the beginning of time, rather than specific to a certain SoC or generation of SoCs. So it doesn't seem like the hw designers consider it a bug. (I'm not sure what the expected behavior is.. nor if any other SMMU implementation encounters this sort of situation..) BR, -R > > Thanks, > Sai > > -- > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a > member > of Code Aurora Forum, hosted by The Linux Foundation