Unfortunately dinq is not working on my IVB at this moment, so I was forced to base these patches on din ie. that's why I've added Chris' patch to the series manually. Regarding whether or not to actually upstream these patches, I think it would be awesome if distros could let us know how interested they are in incorporating this. It is of particular use for any applications using the GPU for compute. Even if distros don't want it, have the uevent/interrupt is nice to incorporate, but I would think twice about the sysfs interface. Now for the explanation (you may want to get a coffee first): Internal to the GPU is a cache referred to in docs as L3. The smallest unit of the cache which is addressable is called a row. There are x rows in each subbank, and y subbanks in each of the z banks. HW provides two extra rows per subbank, and a software mechanism to remap these rows. The addressing after remapping is handled transparently to software. There is also an interrupt generated by the render CS to tell us when a parity error occurs in one of the rows. There is one portion currently unimplemented in the series; we are required to issue a GPU reset before we remap a row. The documents I have do not make it clear *exactly* why the gpu reset must occur, but I believe, similar to Linux, it is the windows mechanism for basically telling GPU clients that whatever work they've submitted needs to be resubmitted. There are various clients which use the L3, however none of these should be utilized during simple modeset/fbcon. Therefore, I believe the following algorithm is guaranteed to work: 1. On boot check some non-volatile storage for bad r/b/s 2. load i915 3. disable bad rbs ASAP 4. Wait forever for uevent of bad r/b/s 5. store r/b/s in some non-volatile storage 6. reboot; goto 1 If we had the reset working, we could avoid the reboot, and instead do: 1. On boot check some non-volatile storage for bad r/b/s 2. load i915 3. disable bad rbs ASAP 4. Wait forever for uevent of bad r/b/s 5. store r/b/s in some non-volatile storage 6. gpu reset; goto 3 The reset is essentially used to "automatically" make all GPU clients aware that they may need to resubmit their data. The problem with algorithm #2 without the reset is that there is no way (afaict) to map the RBS to a BO, and so we have no way to even figure out if the bad data was propagated to the BO. So an alternative to reset is if system software detects the uevent, it can send a signal to all known (or computation based) GPU clients. See the intel-gpu-tools app as a reference for how to use the sysfs interface. Ben Widawsky (4): drm/i915: Dynamic Parity Detection handling drm/i915: enable parity error interrupts drm/i915: remap l3 on hw init drm/i915: l3 parity sysfs interface Chris Wilson (1): drm/i915: Use a global lock for modifying global irq flags drivers/gpu/drm/i915/i915_drv.h | 5 ++ drivers/gpu/drm/i915/i915_gem.c | 26 +++++++ drivers/gpu/drm/i915/i915_irq.c | 87 ++++++++++++++++++++- drivers/gpu/drm/i915/i915_reg.h | 20 +++++ drivers/gpu/drm/i915/i915_sysfs.c | 128 ++++++++++++++++++++++++++++++- drivers/gpu/drm/i915/intel_ringbuffer.c | 45 +++++++---- drivers/gpu/drm/i915/intel_ringbuffer.h | 3 +- 7 files changed, 293 insertions(+), 21 deletions(-) -- 1.7.10