Quoting Jeff McGee (2017-08-29 18:01:47) > On Tue, Aug 29, 2017 at 04:17:46PM +0100, Chris Wilson wrote: > > Quoting Jeff McGee (2017-08-29 16:04:17) > > > On Tue, Aug 29, 2017 at 10:07:18AM +0100, Chris Wilson wrote: > > > > Quoting Jeff McGee (2017-08-28 21:18:44) > > > > > On Mon, Aug 28, 2017 at 08:44:48PM +0100, Chris Wilson wrote: > > > > > > Quoting jeff.mcgee@xxxxxxxxx (2017-08-28 20:25:30) > > > > > > > From: Jeff McGee <jeff.mcgee@xxxxxxxxx> > > > > > > > > > > > > > > If someone else is resetting the engine we should clear our own bit as > > > > > > > part of skipping that engine. Otherwise we will later believe that it > > > > > > > has not been reset successfully and then trigger full gpu reset. If the > > > > > > > other guy's reset actually fails, he will trigger the full gpu reset. > > > > > > > > > > > > The reason we did continue on to the global reset was to serialise > > > > > > i915_handle_error() with the other thread. Not a huge issue, but a > > > > > > reasonable property to keep -- and we definitely want a to explain why > > > > > > only one reset at a time is important. > > > > > > > > > > > > bool intel_engine_lock_reset() { > > > > > > if (!test_and_set_bit(I915_RESET_ENGINE + engine->id, > > > > > > &engine->i915->gpu_error.flags)) > > > > > > return true; > > > > > > > > > > > > intel_engine_wait_for_reset(engine); > > > > > The current code doesn't wait for the other thread to finish the reset, but > > > > > this would add that wait. > > > > > > > > Pardon? If we can't reset the engine, we go to the full reset which is > > > > serialised, both with individual engine resets and other globals. > > > > > > > > > Did you intend that as an additional change to > > > > > the current code? I don't think it is necessary. Each thread wants to > > > > > reset some subset of engines, so it seems the thread can safely exit as soon > > > > > as it knows each of those engines has been reset or is being reset as part > > > > > of another thread that got the lock first. If any of the threads fail to > > > > > reset an engine they "own", then full gpu reset is assured. > > > > > > > > It's unexpected for this function to return before the reset. > > > > -Chris > > > > > > I'm a bit confused, so let's go back to the original code that I was trying > > > to fix: > > > > > > > > > /* > > > * Try engine reset when available. We fall back to full reset if > > > * single reset fails. > > > */ > > > if (intel_has_reset_engine(dev_priv)) { > > > for_each_engine_masked(engine, dev_priv, engine_mask, tmp) { > > > BUILD_BUG_ON(I915_RESET_MODESET >= I915_RESET_ENGINE); > > > if (test_and_set_bit(I915_RESET_ENGINE + engine->id, > > > &dev_priv->gpu_error.flags)) > > > continue; > > > > > > if (i915_reset_engine(engine, 0) == 0) > > > engine_mask &= ~intel_engine_flag(engine); > > > > > > clear_bit(I915_RESET_ENGINE + engine->id, > > > &dev_priv->gpu_error.flags); > > > wake_up_bit(&dev_priv->gpu_error.flags, > > > I915_RESET_ENGINE + engine->id); > > > } > > > } > > > > > > if (!engine_mask) > > > goto out; > > > > > > /* Full reset needs the mutex, stop any other user trying to do so. */ > > > > > > Let's say that 2 threads are here intending to reset render. #1 gets the lock > > > and starts the render engine-only reset. #2 fails to get the lock which implies > > > that someone else is in the process of resetting the render engine (with single > > > engine reset or full gpu reset). #2 continues on without waiting but doesn't > > > clear the render bit in engine_mask. So #2 will proceed to initiate a full > > > gpu reset when it may not be necessary. That's the problem I was trying > > > to address with my initial patch. Do you agree that #2 must clear this bit > > > to avoid always triggering full gpu reset? If the engine-only reset done by > > > #1 fails, #1 will do the fallback to full gpu reset, so there is no risk that > > > we would miss the full gpu reset if it is really needed. > > > > > > Then there is the question of whether #2 should wait around for the > > > render engine reset by #1 to complete. It doesn't in current code and I don't > > > see why it needs to. But that can be a separate discussion from the above. > > > > It very much does in the current code. If we can not do the per-engine > > reset, it falls back to the global reset. > > So are you saying that it is by design in this scenario that #2 will resort > to full gpu reset just because it wasn't the thread that actually performed > the engine reset, even though it can clearly infer based on the engine lock > being held that #1 is performing that reset for him? Yes, that wait was intentional. > > The global reset is serialised > > with itself and the per-engine resets. Ergo it waits, and that was > > intentional. > > -Chris > > Yes, the wait happens because #2 goes on to start full gpu reset which > requires all engine bits to be grabbed. My contention is that it should not > start full gpu reset. And that I am not disputing. Just that returning before the reset is complete changes the current contract. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx