Re: [CI-ping 15/15] drm/i915: Late request cancellations are harmful

Dave Gordon <david.s.gordon@xxxxxxxxx> · Wed, 27 Apr 2016 19:52:10 +0100

On 22/04/16 23:57, John Harrison wrote:
On 21/04/2016 14:04, John Harrison wrote:
On 19/04/2016 13:35, Dave Gordon wrote:
On 13/04/16 15:21, John Harrison wrote:
On 13/04/2016 10:57, Daniel Vetter wrote:
On Tue, Apr 12, 2016 at 09:03:09PM +0100, Chris Wilson wrote:
Conceptually, each request is a record of a hardware transaction - we
build up a list of pending commands and then either commit them to
hardware, or cancel them. However, whilst building up the list of
pending commands, we may modify state outside of the request and make
references to the pending request. If we do so and then cancel that
request, external objects then point to the deleted request
leading to
both graphical and memory corruption.

The easiest example is to consider object/VMA tracking. When we
mark an
object as active in a request, we store a pointer to this, the most
recent request, in the object. Then we want to free that object,
we wait
for the most recent request to be idle before proceeding
(otherwise the
hardware will write to pages now owned by the system, or we will
attempt
to read from those pages before the hardware is finished writing). If
the request was cancelled instead, that wait completes
immediately. As a
result, all requests must be committed and not cancelled if the
external
state is unknown.

All that remains of i915_gem_request_cancel() users are just a
couple of
extremely unlikely allocation failures, so remove the API entirely.

A consequence of committing all incomplete requests is that we
generate
excess breadcrumbs and fill the ring much more often with dummy
work. We
have completely undone the outstanding_last_seqno optimisation.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93907
Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Cc: John Harrison <John.C.Harrison@xxxxxxxxx>

I'd like John's ack on this on too, but patch itself looks sound. Fast
r-b
since we've discussed this a while ago already ...

I think this is going to cause a problem with the scheduler. You are
effectively saying that the execbuf call cannot fail beyond the
point of
allocating a request. If it gets that far then it must go all the way
and submit the request to the hardware. With a scheduler, that means
adding it to the scheduler's queues and tracking it all the way through
the system to completion. If nothing else, that sounds like a lot of
extra overhead for no actual work. Or worse if the failure is at a
point
where the request cannot be sent further through the system because it
was something critical that failed then you are really stuffed.

I'm not sure what the other option would be though, short of being able
to undo the last read/write object tracking updates.

With the chained-ownership code we have in the scheduler, it becomes
perfectly possible to undo the last-read/write tracking changes.

I'd much rather see any failure during submission rewound and undone,
so we can just return -EAGAIN at any point and let someone retry if
required.

This just looks like a hack to work around not having a properly
transactional model of request submission :(

.Dave.

I was thinking if it would be possible to delay the tracking updates
until later in the execbuf process. I.e. only do it after all
potential failure points. That would be a much simpler change than
putting in chained ownership.

However, it seems that the patch has already been merged despite this
discussion and Daniel Vetter wanting an ack first? Is that correct?

John.

Dave Gordon and myself had a look at the option of delaying the object
tracking until beyond the point of last possible failure in the execbuf
call path. As far as we can tell, it already is. The object tracking
update occurs in i915_gem_execbuffer_move_to_active(). That function
cannot return a failure code and is immediately followed (in both LRC
and legacy mode) by a call to i915_gem_execbuffer_retire_commands()
which is what flushes out the request to the hardware. So it would
appear that this patch has no effect on object tracking within the
execbuf code path. If a request cancellation code path was taken then
the tracking would not have been updated and so the request is
irrelevant as it has no references to it. If the tracking was updated
and the request is being referenced then the request was also guaranteed
to have been submitted and not cancelled.

Either we are missing something major somewhere or this patch cannot fix
the stated bug in the stated manner?

I have tried running the failing test myself but when I try to run the
particular gem_concurrent_blit subtest it tells me that it requires more
'objects' and/or RAM than I have available. What does one need in order
to run the test? The bug report also does not say whether it is a
guaranteed failure every time or a sporadic, once in X many runs kind of
failure?

Thanks,
John.

Maybe it's not a cancelled request at all, but a race where submission 
HAS been successful, and the object tracking HAS been updated, but the 
object is nonetheless not (yet) on the request list. Which would leave 
the object list-head UNINITIALISED. And hence crash on retiring :(

My one-line patch for that would fix the crash, but I'd still wonder how 
it got tracked and completed before getting onto the request list.

We could try initialising the list-head to POISON rather than empty, to 
clearly flag removing-before-adding-to-list cases.

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx