Re: [PATCH] drm/i915: Disable atomics in L3 for gen9

Jason Ekstrand <jason@xxxxxxxxxxxxxx> · Mon, 9 Nov 2020 13:52:26 -0600

We need to land this patch.  The number of bugs we have piling up in
Mesa gitlab related to this is getting a lot larger than I'd like.
I've gone back and forth with various HW and SW people internally for
countless e-mail threads and there is no other good workaround.  Yes,
the perf hit to atomics sucks but, fortunately, most games don't use
them heavily enough for it to make a significant impact.  We should
just eat the perf hit and fix the hangs.

Reviewed-by: Jason Ekstrand <jason@xxxxxxxxxxxxxx>

--Jason

On Wed, Jul 24, 2019 at 3:02 PM Francisco Jerez <currojerez@xxxxxxxxxx> wrote:
>
> Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes:
>
> > Quoting Francisco Jerez (2019-07-23 23:19:13)
> >> Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes:
> >>
> >> > Quoting Tvrtko Ursulin (2019-07-22 12:41:36)
> >> >>
> >> >> On 20/07/2019 15:31, Chris Wilson wrote:
> >> >> > Enabling atomic operations in L3 leads to unrecoverable GPU hangs, as
> >> >> > the machine stops responding milliseconds after receipt of the reset
> >> >> > request [GDRT]. By disabling the cached atomics, the hang do not occur
> >> >> > and we presume the GPU would reset normally for similar hangs.
> >> >> >
> >> >> > Reported-by: Jason Ekstrand <jason@xxxxxxxxxxxxxx>
> >> >> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110998
> >> >> > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> >> >> > Cc: Jason Ekstrand <jason@xxxxxxxxxxxxxx>
> >> >> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
> >> >> > Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx>
> >> >> > ---
> >> >> > Jason reports that Windows is not clearing L3SQCREG4:22 and does not
> >> >> > suffer the same GPU hang so it is likely some other w/a that interacts
> >> >> > badly. Fwiw, these 3 are the only registers I could find that mention
> >> >> > atomic ops (and appear to be part of the same chain for memory access).
> >> >>
> >> >> Bit-toggling itself looks fine to me and matches what I could find in
> >> >> the docs. (All three bits across three registers should be equal.)
> >> >>
> >> >> What I am curious about is what are the other consequences of disabling
> >> >> L3 atomics? Performance drop somewhere?
> >> >
> >> > The test I have where it goes from dead to passing, that's a considerable
> >> > performance improvement ;)
> >> >
> >> > I imagine not being able to use L3 for atomics is pretty dire, whether that
> >> > has any impact, I have no clue.
> >> >
> >> > It is still very likely that we see this because we are doing something
> >> > wrong elsewhere.
> >>
> >> This reminds me of f3fc4884ebe6ae649d3723be14b219230d3b7fd2 followed by
> >> d351f6d94893f3ba98b1b20c5ef44c35fc1da124 due to the massive impact (of
> >> the order of 20x IIRC) using the L3 turned out to have on the
> >> performance of HDC atomics, on at least that platform.  It seems
> >> unfortunate that we're going to lose L3 atomics on Gen9 now, even though
> >> it's only buffer atomics which are broken IIUC, and even though the
> >> Windows driver is somehow getting away without disabling them.  Some of
> >> our setup must be wrong either in the kernel or in userspace...  Are
> >> these registers at least whitelisted so userspace can re-enable L3
> >> atomics once the problem is addressed?  Wouldn't it be a more specific
> >> workaround for userspace to simply use a non-L3-cacheable MOCS for
> >> (rarely used) buffer surfaces, so it could benefit from L3 atomics
> >> elsewhere?
> >
> > If it was the case that disabling L3 atomics was the only way to prevent
> > the machine lockup under this scenario, then I think it is
> > unquestionably the right thing to do, and we could not leave it to
> > userspace to dtrt. We should never add non-context saved unsafe
> > registers to the whitelist (if setting a register may cause data
> > corruption or worse in another context/process, that is bad) despite our
> > repeated transgressions. However, there's no evidence to say that it does
> > prevent the machine lockup as it prevents the GPU hang that lead to the
> > lockup on reset.
> >
> > Other than GPGPU requiring a flush around every sneeze, I did not see
> > anything in the gen9 w/a list that seemed like a match. Nevertheless, I
> > expect there is a more precise w/a than a blanket disable.
> > -Chris
>
> Supposedly there is a more precise one (setting the surface state MOCS
> to UC for buffer images), but it relies on userspace doing the right
> thing for the machine not to lock up.  There is a good chance that the
> reason why L3 atomics hang on such buffers is ultimately under userspace
> control, in which case we'll eventually have to undo the programming
> done in this patch in order to re-enable L3 atomics once the problem is
> addressed.  That means that userspace will have the freedom to hang the
> machine hard once again, which sounds really bad, but it's no real news
> for us (*cough* HSW *cough*), and it might be the only way to match the
> performance of the Windows driver.
>
> What can we do here?  Add an i915 option to enable performance features
> that can lead to the system hanging hard under malicious (or
> incompetent) userspace programming?  Probably only the user can tell
> whether the trade-off between performance and security of the system is
> acceptable...
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx