[RFC] GPU reset notification interface

idr at freedesktop.org (Ian Romanick) · Thu, 19 Jul 2012 08:58:42 -0700

On 07/18/2012 09:55 AM, Daniel Vetter wrote:
> On Wed, Jul 18, 2012 at 09:23:46AM -0700, Ian Romanick wrote:
>> On 07/18/2012 02:20 AM, Daniel Vetter wrote:
>>> - The "all contexts in a share group need to receive a reset notification"
>>>    wording is irking me a bit because we currently only track all the
>>>    actively used things. So if another context in that share group isn't
>>>    affected (i.e. doesn't even use any of the potentially corrupted bos),
>>>    is the idea that the kernel grows the required tracking, or should
>>>    userspace always ask the reset state for all contexts and do the math
>>>    itself?
>>
>> There are a couple reasons that all contexts in a share group need
>> to get the reset notification.  Consider an application with two
>> contexts, A and B.  Context A is a worker context that does a bunch
>> of render-to-texture operations, and context B will eventually
>> consume those textures.  If context A receives a reset, context B,
>> even if it hasn't done any rendering in five minutes, has lost data.
>>
>> The kernel should never have any knowledge about GL share groups.
>> This is where my simplifying assumption (that all contexts in a
>> share group share an address space and an fd) comes in handy.  If
>> context A queries the reset status from the kernel first, it can
>> reach over and poke the reset status of context B (in the gl_context
>> structure).  Likewise, if context B queries the kernel first, it can
>> see that another kernel context in its GL context share group got
>> reset.
>>
>>> - The above essentially links in with how we blame the guilt and figure
>>>    out who's affected. Especially when there's no hw context around, e.g.
>>>    on the blitter or bsd rings, or when an object is used by another
>>>    process and gets corrupted there. Does the spec make any guarantees for
>>>    such a case? Or should we just blame an unknown_reset for all the
>>>    contexts that belong to the fd that submitted the bad batch.
>>
>> That sounds right.  If we can't assess innocence or guilt, we should
>> say guilt is unknown.  There are some GL commands that get generate
>> blits, but I don't think there's anything that can get commands on
>> the BSD ring.  That's just for media, right?
>>
>>> As an idea for the above two issues, what about the kernel also keeps a
>>> per-fd reset statistics, which will also be returned with this ioctl?
>>> We'd then restrict the meaning of the ctx fields to only mean "and the
>>> context was actually active". Userspace could then wrap all the per-fd
>>> hang reports into reset_unknown for arb_robustness, but I think this would
>>> be a bit more flexible for other userspace.
>>
>> Ah, right.  So the DDX or libva could detect resets that affect
>> them. That's reasonable.
>>
>>> struct drm_context_reset_counts {
>>> 	__u32 ctx_id;
>>>
>>> 	/**
>>>           * Index of the most recent reset where this context was
>>> 	 * guilty.  Zero if none.
>>>           */
>>> 	__u32 ctx_guilty;
>>>
>>> 	/**
>>>           * Index of the most recent reset where this context was active,
>>> 	 * not guilty.  Zero if none.
>>>           */
>>> 	__u32 ctx_not_guilty;
>>>
>>> 	/**
>>>           * Index of the most recent reset where this context was active,
>>> 	 * but guilt was unknown. Zero if none.
>>>           */
>>> 	__u32 ctx_unknown_guilt;
>>>
>>> 	/**
>>>           * Index of the most recent reset where any batchbuffer submitted
>>> 	 * through this fd was guilty.  Zero if none.
>>>           */
>>> 	__u32 fd_guilty;
>>>
>>> 	/**
>>>           * Index of the most recent reset where any batchbuffer submitted
>>> 	 * through this fd was not guilty, but affected.  Zero if none.
>>>           */
>>> 	__u32 fd_not_guilty;
>>>
>>> 	/**
>>>           * Index of the most recent reset where any batchbuffer submitted
>>> 	 * through this fd was affected, but no guilt for the hang could
>>> 	 * be assigned.  Zero if none.
>>>           */
>>> 	__u32 fd_unknown_guilt;
>>
>> Since these three fields are per-fd, shouldn't they go in the
>> proposed drm_reset_counts structure instead?  If we do that, it
>> might be better to split into two queries.  One for the per-fd
>> information, and one for the detailed per-context information.  If
>> we expect the common case to be no-reset, user space could
>
> Oh, I've missed a bit that your ioctl would read out the state for all
> contexts on a given fd. My idea was that this is all the data that the
> ioctl retuns (and it would only fill out the ctx fields if you spec a
> nonzero context). Userspace would then only need to check whether the
> context has a new reset number somewhere (= context suffered from a gpu
> reset while it was actively used) and then also check whether the fd
> suffered from a reset (= something in that gl share group is potentially
> corrupted), mapping it to either victim_reset or unknown_reset (depending
> upon what exactly the spec wants).

Consider the case where a process has four contexts, A, B, X, and Y.  A 
and B are in a share group, and X and Y are in a share group not with A 
and B.  B sees a reset.  When the application queries the reset status 
on X or Y, it should not see a reset.  Knowing that there was a reset 
for some other context on the same fd isn't enough information.

> Less complexity in userspace, and imo not too much fuss for the kernel to
> do the book-keeping: Instead of only blaming contexts, we also blame
> file_privs. That would be a generally useful feature to e.g. cut off
> certain fds selectively from the gpu because they hang it too fast - i.e.
> to avoid the dreaded global "hanging to fast" when developing drivers ;-)
>
>>> };
>>>
>>> The above could also be important to know when a hw context is toast and
>>> should better not be used any more.
>>>
>>> - I like Chris' suggestion of using our seqno breadcrumbs for this. If you
>>>    also add a small extension to execbuf to return the seqno of the new
>>>    batch, you should also have all the tools in place to implement your
>>>    extension to notify userspace up to which glFence things have completed.
>>>    Note though that returning the seqno in execbuffer is only correct once
>>>    we've eliminated the flushing_list.
>>
>> It's possible, but I need to finish working out that idea (see
>> below). I think the context only needs one seqno, not one per
>> guiltiness level.  "This is the last seqno that was retired before
>> this context lost some data."
>
> The idea was to use the existing seqno as the reset counter in your ioctl.
> It might wrap around, but you need to emit quite a few commands to get to
> that point ;-) So we could use that right away.
>
>> That may still leave the context in a weird state.  Think about this
>> timeline:
>>
>>      Application      GPU
>>      draw A
>>                       submit A
>>      draw B
>>                       submit B
>>                       drawing A completes
>>                       reset (B is lost)
>>      draw C
>> 	             submit C
>>                       drawing C completes
>>      query reset
>>
>> If / when we implement this feature, the kernel may need to drop any
>> work submitted between a reset and an ack of the reset.  Dunno.
>
> Well, we could simply add a robust_context flag, set at create time. If
> such a context suffers from a gpu reset (while it's active) the kernel
> could refuse to queue up more batches with -EIO. Userspace could then

If we go that way, I think it would be better to just drop the requests. 
  OpenGL doesn't let me start generating errors after a reset, so I'd 
have to selectively ignore these errors from the kernel.  It sounds like 
it would add a bunch of annoying complexity to user space.

Of course, we're talking about step 38 here, and we're still on step 2.

> inquire the kernel with your ioctl about what's going on. Similarly we
> could add such a flag to the fd, to catch everyone else not using hw
> contexts. There we need a way to clear that flag though, so that you can
> keep on using all the bo textures that survived (whereas hw contexts are
> pretty much worthless on their own after reset, so we could demand that
> they be destroyed).
>
>>> - Last but not least, how often is userspace expected to call this ioctl?
>>>    Around once per patchbuffer submission or way more/less?
>>
>> I don't know.  This is a fairly new extension, and there are few
>> users.  As far as I'm aware, only Chrome and Firefox use it.  I can
>> find out some details from them.  My guess is somewhere between once
>> per frame and once per draw call.
>
> Cheers, Daniel