[RFC] GPU reset notification interface

idr at freedesktop.org (Ian Romanick) · Tue, 17 Jul 2012 15:16:19 -0700

I'm getting ready to implement the reset notification part of 
GL_ARB_robustness in the i965 driver.  There are a bunch of quirky bits 
of the extension that are causing some grief in designing the kernel / 
user interface.  I think I've settled on an interface that should meet 
all of the requirements, but I want to bounce it off people before I 
start writing code.

Here's the list of requirements.

- Applications poll for reset status.

- Contexts that did not lose data or rendering should not receive a 
reset notification.  This isn't strictly a requirement of the spec, but 
it seems like a good practice.  Once an application receives a reset 
notification for a context, it is supposed to destroy that context and 
start over.

- If one context in an OpenGL share group receives a reset notification, 
all contexts in that share group must receive a reset notification.

- All contexts in a single GL share group will have the same fd.  This 
isn't a requirement so much as a simplifying assumption.  All contexts 
in a share group have to be in the same address space, so I will assume 
that means they're all controlled by one DRI driver instance with a 
single fd.

- The reset notification given to the application should try to assign 
guilt.  There are three values possible: unknown guilt, you're not 
guilty, or you are guilty.

- If there are multiple resets between polls, the application should get 
the "most guilty" answer.  In other words, if there are two resets and 
the context was guilty for one and not the other, the application should 
get the guilty notification.

- After the application polls the status, the status should revert to 
"no reset occurred."

- If the application polls the status and the reset operation is still 
in progress, it should continue to get the reset value until it is 
"safe" to begin issuing GL commands again.

At some point I'd like to extend this to give slightly finer grained 
mechanism so that a context could be told that everything after a 
particular GL sync (fence) operation was lost.  This could prevent some 
applications from having to destroy and rebuild their context.  This 
isn't a requirement, but it's an idea that I've been mulling.

Each time a reset occurs, an internal count is incremented.  This 
associates a unique integer, starting with 1, with each reset event. 
Each context affected by the reset will have the reset event ID stored 
in one its three guilt levels.  An ioctl will be provided that returns 
the following data for all contexts associated with a particular fd.

In addition, it will return the index of any reset operation that is 
still in progress.

I think this should be sufficient information for user space to meet all 
of the requirements.  I had a conversation with Paul and Ken about this. 
  Paul was pretty happy with the idea.  Ken felt like too little policy 
was in the kernel, and the resulting interface was too heavy (I'm 
paraphrasing).

struct drm_context_reset_counts {
	__u32 ctx_id;

	/**
          * Index of the most recent reset where this context was
	 * guilty.  Zero if none.
          */
	__u32 guilty;

	/**
          * Index of the most recent reset where this context was
	 * not guilty.  Zero if none.
          */
	__u32 not_guilty;

	/**
          * Index of the most recent reset where guilt was unknown.
	 * Zero if none.
          */
	__u32 unknown_guilt;
};

struct drm_reset_counts {
	/** Index of the in-progress reset.  Zero if none. */
	unsigned reset_index_in_progress;

	/** Number of contexts. */
	__u32 num_contexts;

	struct drm_context_reset_counts contexts[0];
};