On Mon, Apr 11, 2016 at 02:50:17PM +0100, Tvrtko Ursulin wrote: > > On 09/04/16 10:27, Chris Wilson wrote: > >Conceptually, each request is a record of a hardware transaction - we > >build up a list of pending commands and then either commit them to > >hardware, or cancel them. However, whilst building up the list of > >pending commands, we may modify state outside of the request and make > >references to the pending request. If we do so and then cancel that > >request, external objects then point to the deleted request leading to > >both graphical and memory corruption. > > > >The easiest example is to consider object/VMA tracking. When we mark an > >object as active in a request, we store a pointer to this, the most > >recent request, in the object. Then we want to free that object, we wait > >for the most recent request to be idle before proceeding (otherwise the > >hardware will write to pages now owned by the system, or we will attempt > >to read from those pages before the hardware is finished writing). If > >the request was cancelled instead, that wait completes immediately. As a > >result, all requests must be committed and not cancelled if the external > >state is unknown. > > This was a bit hard to figure out. > > So we cannot unwind because once we set last_read_req we lose the > data on what was the previous one, before this transaction started? > > Intuitively I don't like the idea of sending unfinished stuff to the > GPU, when it failed at some random point in ring buffer preparation. Yes, but it is not unfinished though. The mm-switch is made and flagged as completed, etc. The request does contain instructions for the state changes it has made so far. > So I am struggling with reviewing this as I have in the previous round. > > >All that remains of i915_gem_request_cancel() users are just a couple of > >extremely unlikely allocation failures, so remove the API entirely. > > This parts feels extra weird because in the non-execbuf cases we > actually can cancel the transaction without any issues, correct? Nope. Same problem arises in that they do not know what happens underneath calls to e.g. object_sync. Before cancellation you have to inspect every path now and in the future to be sure that no state was modified inside the request. > Would middle-ground be to keep the cancellations for in-kernel > submits, and for execbuf rewind the ringbuf so only request > post-amble is sent to the GPU? The only place that knows whether an external observer is the request code. I am thinking about doing the cancellation there, I just need to check that the lockless idling will be ok with that. > >@@ -3410,12 +3404,9 @@ int i915_gpu_idle(struct drm_device *dev) > > return PTR_ERR(req); > > > > ret = i915_switch_context(req); > >- if (ret) { > >- i915_gem_request_cancel(req); > >- return ret; > >- } > >- > > i915_add_request_no_flush(req); > >+ if (ret) > >+ return ret; > > Looks like with this it could execute the context switch on the GPU > but not update the engine->last_context in do_switch(). Hmm, that problem is present in the current code - requests are not unwound, just deleted. We need to rearrange switch_context after the mi_set_context() to ensure that on the subsequent call, the context is reloaded. -Chris -- Chris Wilson, Intel Open Source Technology Centre -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html