Quoting Tvrtko Ursulin (2019-11-20 12:40:13) > > On 20/11/2019 12:07, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2019-11-20 11:58:27) > >> > >> On 20/11/2019 09:32, Chris Wilson wrote: > >>> In commit a79ca656b648 ("drm/i915: Push the wakeref->count deferral to > >>> the backend"), I erroneously concluded that we last modify the engine > >>> inside __i915_request_commit() meaning that we could enable concurrent > >>> submission for userspace as we enqueued this request. However, this > >>> falls into a trap with other users of the engine->kernel_context waking > >>> up and submitting their request before the idle-switch is queued, with > >>> the result that the kernel_context is executed out-of-sequence most > >>> likely upsetting the GPU and certainly ourselves when we try to retire > >>> the out-of-sequence requests. > >>> > >>> As such we need to hold onto the effective engine->kernel_context mutex > >>> lock (via the engine pm mutex proxy) until we have finish queuing the > >>> request to the engine. > >>> > >>> v2: Serialise against concurrent intel_gt_retire_requests() > >>> v3: Describe the hairy locking scheme with intel_gt_retire_requests() > >>> for future reference. > >>> v4: Combine timeline->lock and engine pm release; it's hairy. > >>> > >>> Fixes: a79ca656b648 ("drm/i915: Push the wakeref->count deferral to the backend") > >>> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >>> Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > >>> --- > >>> drivers/gpu/drm/i915/gt/intel_engine_pm.c | 47 +++++++++++++++++++---- > >>> 1 file changed, 40 insertions(+), 7 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c > >>> index 3c0f490ff2c7..1f517357a268 100644 > >>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c > >>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c > >>> @@ -73,8 +73,25 @@ static inline void __timeline_mark_unlock(struct intel_context *ce, > >>> > >>> #endif /* !IS_ENABLED(CONFIG_LOCKDEP) */ > >>> > >>> +static void > >>> +__intel_timeline_enter_and_pm_release(struct intel_timeline *tl, > >>> + struct intel_engine_cs *engine) > >>> +{ > >>> + struct intel_gt_timelines *timelines = &engine->gt->timelines; > >>> + > >>> + spin_lock(&timelines->lock); > >>> + > >>> + if (!atomic_fetch_inc(&tl->active_count)) > >>> + list_add_tail(&tl->link, &timelines->active_list); > >> > >> Hmm with these new part it maybe matches/answers my question from > >> "drm/i915/gt: Close race between engine_park and > >> intel_gt_retire_requests". I think at least. Since it now adds a second > >> place timeline can enter the active_list. > >> > >> But no, where does the intel_timeline_enter race come from? Can't be > >> userspace submission since they are blocked on wf->lock. > > > > The race is not just with intel_timeline_enter, but with > > intel_gt_retire_requests. > > > > As soon as we are on that list, we may be retired. If we are retired > > before adjusting the engine->wakeref.count, we are b0rked. > > > > As soon as we adjust the engine->wakeref.count, another submission may > > call intel_timeline_enter, and again may even retire this request. The > > enter itself is perfectly fine, but we need to serialise against the > > retires. > > I think the two patches combined work, I am just not sure two > atomic_fetch_inc()->list_add() are needed now that you re-ordered > __i915_requests_queue and __intel_wakeref_defer_park - that's the part > which is confusing me. But it also doesn't harm... I tried to get away with not, but the selftests hammer very heavily on the engine->kernel_context so we do encounter the scenarios where we are using the kernel_context to park on one cpu while submitting a new request on another. I would have got away with it but for these pesky selftests. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx