[RFC] GPU reset notification interface

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07/18/2012 02:20 AM, Daniel Vetter wrote:
> On Tue, Jul 17, 2012 at 03:16:19PM -0700, Ian Romanick wrote:
>> I'm getting ready to implement the reset notification part of
>> GL_ARB_robustness in the i965 driver.  There are a bunch of quirky
>> bits of the extension that are causing some grief in designing the
>> kernel / user interface.  I think I've settled on an interface that
>> should meet all of the requirements, but I want to bounce it off
>> people before I start writing code.
>>
>> Here's the list of requirements.
>>
>> - Applications poll for reset status.
>>
>> - Contexts that did not lose data or rendering should not receive a
>> reset notification.  This isn't strictly a requirement of the spec,
>> but it seems like a good practice.  Once an application receives a
>> reset notification for a context, it is supposed to destroy that
>> context and start over.
>>
>> - If one context in an OpenGL share group receives a reset
>> notification, all contexts in that share group must receive a reset
>> notification.
>>
>> - All contexts in a single GL share group will have the same fd.
>> This isn't a requirement so much as a simplifying assumption.  All
>> contexts in a share group have to be in the same address space, so I
>> will assume that means they're all controlled by one DRI driver
>> instance with a single fd.
>>
>> - The reset notification given to the application should try to
>> assign guilt.  There are three values possible: unknown guilt,
>> you're not guilty, or you are guilty.
>>
>> - If there are multiple resets between polls, the application should
>> get the "most guilty" answer.  In other words, if there are two
>> resets and the context was guilty for one and not the other, the
>> application should get the guilty notification.
>>
>> - After the application polls the status, the status should revert
>> to "no reset occurred."
>>
>> - If the application polls the status and the reset operation is
>> still in progress, it should continue to get the reset value until
>> it is "safe" to begin issuing GL commands again.
>>
>> At some point I'd like to extend this to give slightly finer grained
>> mechanism so that a context could be told that everything after a
>> particular GL sync (fence) operation was lost.  This could prevent
>> some applications from having to destroy and rebuild their context.
>> This isn't a requirement, but it's an idea that I've been mulling.
>>
>> Each time a reset occurs, an internal count is incremented.  This
>> associates a unique integer, starting with 1, with each reset event.
>> Each context affected by the reset will have the reset event ID
>> stored in one its three guilt levels.  An ioctl will be provided
>> that returns the following data for all contexts associated with a
>> particular fd.
>>
>> In addition, it will return the index of any reset operation that is
>> still in progress.
>>
>> I think this should be sufficient information for user space to meet
>> all of the requirements.  I had a conversation with Paul and Ken
>> about this.  Paul was pretty happy with the idea.  Ken felt like too
>> little policy was in the kernel, and the resulting interface was too
>> heavy (I'm paraphrasing).
>
> A few things:
> - I agree with Chris that reset_in_progress should go, if userspace can
>    sneak in and witness a reset event, we have a bug in the kernel. Since
>    very recently, we actually have a few bugs less in that area ;-)

I'm operating under the assumption that, from user space's perspective, 
resets are not instantaneous.  If resets are instantaneous, that may 
change things.

I had envisioned two potential uses for reset_in_progress, but I've 
managed to talk myself out of both.

> - The "all contexts in a share group need to receive a reset notification"
>    wording is irking me a bit because we currently only track all the
>    actively used things. So if another context in that share group isn't
>    affected (i.e. doesn't even use any of the potentially corrupted bos),
>    is the idea that the kernel grows the required tracking, or should
>    userspace always ask the reset state for all contexts and do the math
>    itself?

There are a couple reasons that all contexts in a share group need to 
get the reset notification.  Consider an application with two contexts, 
A and B.  Context A is a worker context that does a bunch of 
render-to-texture operations, and context B will eventually consume 
those textures.  If context A receives a reset, context B, even if it 
hasn't done any rendering in five minutes, has lost data.

The kernel should never have any knowledge about GL share groups.  This 
is where my simplifying assumption (that all contexts in a share group 
share an address space and an fd) comes in handy.  If context A queries 
the reset status from the kernel first, it can reach over and poke the 
reset status of context B (in the gl_context structure).  Likewise, if 
context B queries the kernel first, it can see that another kernel 
context in its GL context share group got reset.

> - The above essentially links in with how we blame the guilt and figure
>    out who's affected. Especially when there's no hw context around, e.g.
>    on the blitter or bsd rings, or when an object is used by another
>    process and gets corrupted there. Does the spec make any guarantees for
>    such a case? Or should we just blame an unknown_reset for all the
>    contexts that belong to the fd that submitted the bad batch.

That sounds right.  If we can't assess innocence or guilt, we should say 
guilt is unknown.  There are some GL commands that get generate blits, 
but I don't think there's anything that can get commands on the BSD 
ring.  That's just for media, right?

> As an idea for the above two issues, what about the kernel also keeps a
> per-fd reset statistics, which will also be returned with this ioctl?
> We'd then restrict the meaning of the ctx fields to only mean "and the
> context was actually active". Userspace could then wrap all the per-fd
> hang reports into reset_unknown for arb_robustness, but I think this would
> be a bit more flexible for other userspace.

Ah, right.  So the DDX or libva could detect resets that affect them. 
That's reasonable.

> struct drm_context_reset_counts {
> 	__u32 ctx_id;
>
> 	/**
>           * Index of the most recent reset where this context was
> 	 * guilty.  Zero if none.
>           */
> 	__u32 ctx_guilty;
>
> 	/**
>           * Index of the most recent reset where this context was active,
> 	 * not guilty.  Zero if none.
>           */
> 	__u32 ctx_not_guilty;
>
> 	/**
>           * Index of the most recent reset where this context was active,
> 	 * but guilt was unknown. Zero if none.
>           */
> 	__u32 ctx_unknown_guilt;
>
> 	/**
>           * Index of the most recent reset where any batchbuffer submitted
> 	 * through this fd was guilty.  Zero if none.
>           */
> 	__u32 fd_guilty;
>
> 	/**
>           * Index of the most recent reset where any batchbuffer submitted
> 	 * through this fd was not guilty, but affected.  Zero if none.
>           */
> 	__u32 fd_not_guilty;
>
> 	/**
>           * Index of the most recent reset where any batchbuffer submitted
> 	 * through this fd was affected, but no guilt for the hang could
> 	 * be assigned.  Zero if none.
>           */
> 	__u32 fd_unknown_guilt;

Since these three fields are per-fd, shouldn't they go in the proposed 
drm_reset_counts structure instead?  If we do that, it might be better 
to split into two queries.  One for the per-fd information, and one for 
the detailed per-context information.  If we expect the common case to 
be no-reset, user space could

> };
>
> The above could also be important to know when a hw context is toast and
> should better not be used any more.
>
> - I like Chris' suggestion of using our seqno breadcrumbs for this. If you
>    also add a small extension to execbuf to return the seqno of the new
>    batch, you should also have all the tools in place to implement your
>    extension to notify userspace up to which glFence things have completed.
>    Note though that returning the seqno in execbuffer is only correct once
>    we've eliminated the flushing_list.

It's possible, but I need to finish working out that idea (see below). 
I think the context only needs one seqno, not one per guiltiness level. 
  "This is the last seqno that was retired before this context lost some 
data."

That may still leave the context in a weird state.  Think about this 
timeline:

     Application      GPU
     draw A
                      submit A
     draw B
                      submit B
                      drawing A completes
                      reset (B is lost)
     draw C
	             submit C
                      drawing C completes
     query reset

If / when we implement this feature, the kernel may need to drop any 
work submitted between a reset and an ack of the reset.  Dunno.

> - Last but not least, how often is userspace expected to call this ioctl?
>    Around once per patchbuffer submission or way more/less?

I don't know.  This is a fairly new extension, and there are few users. 
  As far as I'm aware, only Chrome and Firefox use it.  I can find out 
some details from them.  My guess is somewhere between once per frame 
and once per draw call.

> Cheers, Daniel
>>
>> struct drm_context_reset_counts {
>> 	__u32 ctx_id;
>>
>> 	/**
>>           * Index of the most recent reset where this context was
>> 	 * guilty.  Zero if none.
>>           */
>> 	__u32 guilty;
>>
>> 	/**
>>           * Index of the most recent reset where this context was
>> 	 * not guilty.  Zero if none.
>>           */
>> 	__u32 not_guilty;
>>
>> 	/**
>>           * Index of the most recent reset where guilt was unknown.
>> 	 * Zero if none.
>>           */
>> 	__u32 unknown_guilt;
>> };
>>
>> struct drm_reset_counts {
>> 	/** Index of the in-progress reset.  Zero if none. */
>> 	unsigned reset_index_in_progress;
>>
>> 	/** Number of contexts. */
>> 	__u32 num_contexts;
>>
>> 	struct drm_context_reset_counts contexts[0];
>> };
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux