We need to land this patch. The number of bugs we have piling up in Mesa gitlab related to this is getting a lot larger than I'd like. I've gone back and forth with various HW and SW people internally for countless e-mail threads and there is no other good workaround. Yes, the perf hit to atomics sucks but, fortunately, most games don't use them heavily enough for it to make a significant impact. We should just eat the perf hit and fix the hangs. Reviewed-by: Jason Ekstrand <jason@xxxxxxxxxxxxxx> --Jason On Wed, Jul 24, 2019 at 3:02 PM Francisco Jerez <currojerez@xxxxxxxxxx> wrote: > > Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > > > Quoting Francisco Jerez (2019-07-23 23:19:13) > >> Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > >> > >> > Quoting Tvrtko Ursulin (2019-07-22 12:41:36) > >> >> > >> >> On 20/07/2019 15:31, Chris Wilson wrote: > >> >> > Enabling atomic operations in L3 leads to unrecoverable GPU hangs, as > >> >> > the machine stops responding milliseconds after receipt of the reset > >> >> > request [GDRT]. By disabling the cached atomics, the hang do not occur > >> >> > and we presume the GPU would reset normally for similar hangs. > >> >> > > >> >> > Reported-by: Jason Ekstrand <jason@xxxxxxxxxxxxxx> > >> >> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110998 > >> >> > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >> >> > Cc: Jason Ekstrand <jason@xxxxxxxxxxxxxx> > >> >> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > >> >> > Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> > >> >> > --- > >> >> > Jason reports that Windows is not clearing L3SQCREG4:22 and does not > >> >> > suffer the same GPU hang so it is likely some other w/a that interacts > >> >> > badly. Fwiw, these 3 are the only registers I could find that mention > >> >> > atomic ops (and appear to be part of the same chain for memory access). > >> >> > >> >> Bit-toggling itself looks fine to me and matches what I could find in > >> >> the docs. (All three bits across three registers should be equal.) > >> >> > >> >> What I am curious about is what are the other consequences of disabling > >> >> L3 atomics? Performance drop somewhere? > >> > > >> > The test I have where it goes from dead to passing, that's a considerable > >> > performance improvement ;) > >> > > >> > I imagine not being able to use L3 for atomics is pretty dire, whether that > >> > has any impact, I have no clue. > >> > > >> > It is still very likely that we see this because we are doing something > >> > wrong elsewhere. > >> > >> This reminds me of f3fc4884ebe6ae649d3723be14b219230d3b7fd2 followed by > >> d351f6d94893f3ba98b1b20c5ef44c35fc1da124 due to the massive impact (of > >> the order of 20x IIRC) using the L3 turned out to have on the > >> performance of HDC atomics, on at least that platform. It seems > >> unfortunate that we're going to lose L3 atomics on Gen9 now, even though > >> it's only buffer atomics which are broken IIUC, and even though the > >> Windows driver is somehow getting away without disabling them. Some of > >> our setup must be wrong either in the kernel or in userspace... Are > >> these registers at least whitelisted so userspace can re-enable L3 > >> atomics once the problem is addressed? Wouldn't it be a more specific > >> workaround for userspace to simply use a non-L3-cacheable MOCS for > >> (rarely used) buffer surfaces, so it could benefit from L3 atomics > >> elsewhere? > > > > If it was the case that disabling L3 atomics was the only way to prevent > > the machine lockup under this scenario, then I think it is > > unquestionably the right thing to do, and we could not leave it to > > userspace to dtrt. We should never add non-context saved unsafe > > registers to the whitelist (if setting a register may cause data > > corruption or worse in another context/process, that is bad) despite our > > repeated transgressions. However, there's no evidence to say that it does > > prevent the machine lockup as it prevents the GPU hang that lead to the > > lockup on reset. > > > > Other than GPGPU requiring a flush around every sneeze, I did not see > > anything in the gen9 w/a list that seemed like a match. Nevertheless, I > > expect there is a more precise w/a than a blanket disable. > > -Chris > > Supposedly there is a more precise one (setting the surface state MOCS > to UC for buffer images), but it relies on userspace doing the right > thing for the machine not to lock up. There is a good chance that the > reason why L3 atomics hang on such buffers is ultimately under userspace > control, in which case we'll eventually have to undo the programming > done in this patch in order to re-enable L3 atomics once the problem is > addressed. That means that userspace will have the freedom to hang the > machine hard once again, which sounds really bad, but it's no real news > for us (*cough* HSW *cough*), and it might be the only way to match the > performance of the Windows driver. > > What can we do here? Add an i915 option to enable performance features > that can lead to the system hanging hard under malicious (or > incompetent) userspace programming? Probably only the user can tell > whether the trade-off between performance and security of the system is > acceptable... > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/intel-gfx _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx