[PATCH 0/5] Dynamic Parity Detection/Correction

ben at bwidawsk.net (Ben Widawsky) · Fri, 27 Apr 2012 17:40:16 -0700

Unfortunately dinq is not working on my IVB at this moment, so I was forced to
base these patches on din ie. that's why I've added Chris' patch to the series
manually.

Regarding whether or not to actually upstream these patches, I think it
would be awesome if distros could let us know how interested they are in
incorporating this. It is of particular use for any applications using
the GPU for compute. Even if distros don't want it, have the
uevent/interrupt is nice to incorporate, but I would think twice about
the sysfs interface.

Now for the explanation (you may want to get a coffee first):
Internal to the GPU is a cache referred to in docs as L3. The smallest
unit of the cache which is addressable is called a row. There are x rows
in each subbank, and y subbanks in each of the z banks.

HW provides two extra rows per subbank, and a software mechanism to
remap these rows. The addressing after remapping is handled
transparently to software. There is also an interrupt generated by the
render CS to tell us when a parity error occurs in one of the rows.

There is one portion currently unimplemented in the series; we are
required to issue a GPU reset before we remap a row. The documents I
have do not make it clear *exactly* why the gpu reset must occur, but I
believe, similar to Linux, it is the windows mechanism for basically
telling GPU clients that whatever work they've submitted needs to be
resubmitted.

There are various clients which use the L3, however none of these should
be utilized during simple modeset/fbcon.  Therefore, I believe the
following algorithm is guaranteed to work:
1. On boot check some non-volatile storage for bad r/b/s
2. load i915
3. disable bad rbs ASAP
4. Wait forever for uevent of bad r/b/s
5. store r/b/s in some non-volatile storage
6. reboot; goto 1

If we had the reset working, we could avoid the reboot, and instead do:
1. On boot check some non-volatile storage for bad r/b/s
2. load i915
3. disable bad rbs ASAP
4. Wait forever for uevent of bad r/b/s
5. store r/b/s in some non-volatile storage
6. gpu reset; goto 3

The reset is essentially used to "automatically" make all GPU clients
aware that they may need to resubmit their data. The problem with
algorithm #2 without the reset is that there is no way (afaict) to map
the RBS to a BO, and so we have no way to even figure out if the bad
data was propagated to the BO. So an alternative to reset is if system
software detects the uevent, it can send a signal to all known (or
computation based) GPU clients.

See the intel-gpu-tools app as a reference for how to use the sysfs
interface.

Ben Widawsky (4):
  drm/i915: Dynamic Parity Detection handling
  drm/i915: enable parity error interrupts
  drm/i915: remap l3 on hw init
  drm/i915: l3 parity sysfs interface

Chris Wilson (1):
  drm/i915: Use a global lock for modifying global irq flags

 drivers/gpu/drm/i915/i915_drv.h         |    5 ++
 drivers/gpu/drm/i915/i915_gem.c         |   26 +++++++
 drivers/gpu/drm/i915/i915_irq.c         |   87 ++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_reg.h         |   20 +++++
 drivers/gpu/drm/i915/i915_sysfs.c       |  128 ++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.c |   45 +++++++----
 drivers/gpu/drm/i915/intel_ringbuffer.h |    3 +-
 7 files changed, 293 insertions(+), 21 deletions(-)

-- 
1.7.10