Quoting Tvrtko Ursulin (2019-01-31 13:19:31) > > On 30/01/2019 02:19, Chris Wilson wrote: > > Having introduced per-context seqno, we now have a means to identity > > progress across the system without feel of rollback as befell the > > global_seqno. That is we can program a MI_SEMAPHORE_WAIT operation in > > advance of submission safe in the knowledge that our target seqno and > > address is stable. > > > > However, since we are telling the GPU to busy-spin on the target address > > until it matches the signaling seqno, we only want to do so when we are > > sure that busy-spin will be completed quickly. To achieve this we only > > submit the request to HW once the signaler is itself executing (modulo > > preemption causing us to wait longer), and we only do so for default and > > above priority requests (so that idle priority tasks never themselves > > hog the GPU waiting for others). > > It could be milliseconds though. I think apart from media-bench saying > this is faster, we would need to look at performance per Watt as well. All throughput measurements are substantially faster, as you would expect, and inter-engine latency decreased. I would hope it would powergate/rc6 the EU while the CS was spinning, but I don't know :) > RING_SEMA_WAIT_POLL is a potential tunable as well. Not that I have an > idea how to tune it. > > Eventually, do we dare adding this without a runtime switch? (There, I > mentioned the taboo.) Yes :p > What about signal mode and handling this via context switches? That's 99% of the timeslicing scheduler right there -- the handling of deferred work with the complication of it impacting other engines. > > +static int > > +emit_semaphore_wait(struct i915_request *to, > > + struct i915_request *from, > > + gfp_t gfp) > > +{ > > + u32 *cs; > > + int err; > > + > > + GEM_BUG_ON(!from->timeline->has_initial_breadcrumb); > > + > > + err = i915_timeline_read_lock(from->timeline, to); > > + if (err) > > + return err; > > + > > + /* > > + * If we know our signaling request has started, we know that it > > + * must, at least, have passed its initial breadcrumb and that its > > + * seqno can only increase, therefore any change in its breadcrumb > > + * must indicate completion. By using a "not equal to start" compare > > + * we avoid the murky issue of how to handle seqno wraparound in an > > + * async environment (short answer, we must stop the world whenever > > + * any context wraps!) as the likelihood of missing one request then > > + * seeing the same start value for a new request is 1 in 2^31, and > > + * even then we know that the new request has started and is in > > + * progress, so we are sure it will complete soon enough (not to > > + * worry about). > > + */ > > + if (i915_request_started(from)) { > > + cs = intel_ring_begin(to, 4); > > + if (IS_ERR(cs)) > > + return PTR_ERR(cs); > > + > > + *cs++ = MI_SEMAPHORE_WAIT | > > + MI_SEMAPHORE_GLOBAL_GTT | > > + MI_SEMAPHORE_POLL | > > + MI_SEMAPHORE_SAD_NEQ_SDD; > > + *cs++ = from->fence.seqno - 1; > > + *cs++ = from->timeline->hwsp_offset; > > + *cs++ = 0; > > + > > + intel_ring_advance(to, cs); > > + } else { > > + int err; > > + > > + err = i915_request_await_execution(to, from, gfp); > > + if (err) > > + return err; > > + > > + cs = intel_ring_begin(to, 4); > > + if (IS_ERR(cs)) > > + return PTR_ERR(cs); > > + > > + /* > > + * Using greater-than-or-equal here means we have to worry > > + * about seqno wraparound. To side step that issue, we swap > > + * the timeline HWSP upon wrapping, so that everyone listening > > + * for the old (pre-wrap) values do not see the much smaller > > + * (post-wrap) values than they were expecting (and so wait > > + * forever). > > + */ > > + *cs++ = MI_SEMAPHORE_WAIT | > > + MI_SEMAPHORE_GLOBAL_GTT | > > + MI_SEMAPHORE_POLL | > > + MI_SEMAPHORE_SAD_GTE_SDD; > > + *cs++ = from->fence.seqno; > > + *cs++ = from->timeline->hwsp_offset; > > + *cs++ = 0; > > + > > + intel_ring_advance(to, cs); > > + } > > Would it not work to have a single path which emits the wait on NEQ > from->fence.seqno - 1, just i915_request_await_execution conditional on > i915_request_started? > > In the !started case, having added the await, we would know the > semaphore wait would not run until after the dependency has started, and > NEQ would be true when it completes. The same as the above started path. We may have previously submitted the signaler in a very long queue to its engine so cannot determine its position, in which case we could sample a long time before it even begins. Even if we launch both requests on the different engines at the same time, we could sample before the started semaphore. I should remove the current NEQ path, it was before I committed myself to handling the HWSP across wraparounds, and is now just needless complication. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx