Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > On Mon, Feb 13, 2017 at 11:56:46AM +0200, Mika Kuoppala wrote: >> Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: >> >> > The signal threads may be running concurrently with the GPU reset. The >> > completion from the GPU run asynchronous with the reset and two threads >> > may see different snapshots of the state, and the signaler may mark a >> > request as complete as we try to reset it. We don't tolerate 2 different >> > views of the same state and complain if we try to mark a request as >> > failed if it is already complete. Disable the signal threads during >> > reset to prevent this conflict (even though the conflict implies that >> > the state we resetting to is invalid, we have already made our >> > decision!). >> > >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=99671 >> > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> >> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxx> >> > Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> >> > --- >> > drivers/gpu/drm/i915/i915_gem.c | 16 +++++++++++++++- >> > drivers/gpu/drm/i915/intel_breadcrumbs.c | 3 +++ >> > 2 files changed, 18 insertions(+), 1 deletion(-) >> > >> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c >> > index 48922ff454e6..95582295b219 100644 >> > --- a/drivers/gpu/drm/i915/i915_gem.c >> > +++ b/drivers/gpu/drm/i915/i915_gem.c >> > @@ -35,6 +35,7 @@ >> > #include "intel_frontbuffer.h" >> > #include "intel_mocs.h" >> > #include <linux/dma-fence-array.h> >> > +#include <linux/kthread.h> >> > #include <linux/reservation.h> >> > #include <linux/shmem_fs.h> >> > #include <linux/slab.h> >> > @@ -2643,6 +2644,17 @@ int i915_gem_reset_prepare(struct drm_i915_private *dev_priv) >> > for_each_engine(engine, dev_priv, id) { >> > struct drm_i915_gem_request *request; >> > >> > + /* Prevent the signaler thread from updating the request >> > + * state (by calling dma_fence_signal) as we are processing >> > + * the reset. The write from the GPU of the seqno is >> > + * asynchronous and the signaler thread may see a different >> > + * value to us and declare the request complete, even though >> > + * the reset routine have picked that request as the active >> > + * (incomplete) request. This conflict is not handled >> > + * gracefully! >> > + */ >> > + kthread_park(engine->breadcrumbs.signaler); >> > + >> >> I was wondering here if it would be best to disable the tasklet first. >> But stopping the car before we stop the engine seems wise thing to do. > > Note that we would have to do two passes over the engines to stop the > signalers, and even then we still have other sources (other devices) > that can cause tasklets to be invoked. There is not a foolproof answer > here, they each serve a different purpose and can be disabled > orthogonally. > >> Another unrelated thing is that we might be better off just to >> make find_active_request always seqno coherent and fork a >> noncoherent version for error capture. > > It's slightly more than that. In the worst case irq_barrier is not > sufficient, and a second call to active_request may return a different > victim - which again raises some awkwards issues in how we assign blame. > > At least we now do the irq_barrier hammer once at the start in reset_prepare, > so we should be better, but I'm wondering if we want to store the > request from prepare and then double check in the actual reset. > #1 store seqno from hangcheck #2 get mutex for reset #3 barrier #4 find_request (only once) #5 on prepare path, check the submachinery against this req and if inconsistent, queue hangcheck and return from prepare without resetting. ? -Mika _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx