[PATCH 09/13] drm/i915: add reset_state for hw_contexts

chris at chris-wilson.co.uk (Chris Wilson) · Thu, 28 Feb 2013 11:14:29 +0000

On Wed, Feb 27, 2013 at 05:15:08PM -0800, Ian Romanick wrote:
> On 02/27/2013 01:13 AM, Chris Wilson wrote:
> >On Tue, Feb 26, 2013 at 05:47:12PM -0800, Ian Romanick wrote:
> >>On 02/26/2013 03:05 AM, Mika Kuoppala wrote:
> >>>For arb-robustness, every context needs to have it's own
> >>>reset state tracking. Default context will be handled in a identical
> >>>way as the no-context case in further down in the patch set.
> >>>For no-context case, the reset state will be stored in
> >>>the file_priv part.
> >>>
> >>>v2: handle default context inside get_reset_state
> >>
> >>This isn't the interface we want.  I already sent you the patches
> >>for Mesa, and you seem to have completely ignored them.  Moreover,
> >>this interface provides no mechanism to query for its existence
> >>(other than relying on the kernel version), and no method to
> >>deprecate it.
> >
> >It's existence, and the existence of any successor, is the only test you
> >need to check for the interface. Testing the flag field up front also
> >lets you know if individual flags are known prior to use.
> >
> >>Based on e-mail discussions, I think danvet agrees with me here.
> >>
> >>Putting guilty / innocent counting in kernel puts policy decisions
> >>in the kernel that belong with the user space API that implements
> >>them. Putting these choices in the kernel significantly decreases
> >>how "future proof" the interface is.  Since any kernel/user
> >>interface has to be kept forever, this is just asking for
> >>maintenance trouble down the road.
> >
> >I think you have the policy standpoint reversed.
> 
> Can you elaborate?  I've been out of the kernel loop for a long
> time, but I always thought policy and mechanism were supposed to be
> separated.  In the case of drivers with a user-mode component, the
> mechanism was in the kernel (where else could it be?), and policy
> decisions were in user-mode.  This is often at odds with keeping the
> interfaces between kernel and user thin, so that may be where my
> misunderstanding is.

Right, the idea is to keep policy out of the kernel. I disagree that
your suggestion succeeds in doing this.

[snip to details]

> There are two requirements in the ARB_robustness extension (and
> additional layered extensions) that I believe cause problems with
> all of the proposed counting interfaces.
> 
> 1. GL contexts that are not affected by a reset (i.e., didn't lose
> any objects, didn't lose any rendering, etc.) should not observe it.
> This is the big one that all the browser vendors want.  If you have
> 5 tabs open, a reset caused by a malicious WebGL app in one tab
> shouldn't, if possible, force them to rebuild all of their state for
> all the other tabs.
> 
> 2. After a GL context observes a reset, it is garbage.  It must be
> destroyed and a new one created.  This means we only need to know
> that a reset of any variety affected the context.

For me observes implies the ability for the context to inspect global
system state, whereas I think you mean if the context and its associated
state is affected by a reset.

> In addition, I'm concerned about having one GL context be able to
> observe, in any manner, resets that did not affect it.  Given that
> people have figured out how to get encryption keys by observing
> cache timings, it's not impossible for a malicious program to use
> information about reset counts to do something.  Leaving a potential
> gap like this in an extension that's used to improved security
> seems... like we're just asking for a Phoronix article.  We don't
> have any requirement to expose this information, so I don't think we
> should.

So we should not do minor+major pagefault tracking either? I only
suggested that you could read the global state because I thought you
implied you wanted it. From the display server POV, global state is the
most useful as I am more interested in whether I can reliably use the GPU
and prevent a DoS from a misbehaving set of clients.

> We also don't have any requirement to expose this functionality on
> anything pre-Sandy Bridge.  We may want, at some point, to expose it
> on Ironlake.  Based on that, it seems reasonable to tie the
> availability of reset notifications to hardware contexts.  Since we
> won't test it, Mesa certainly isn't going to expose it on 8xx or
> 915.

>From a design standpoint hw contexts are just one function of the context
object and not the reverse. Ben has been rightfully arguing that we need
contexts in the kernel for far more than supporting logical hw contexts.

> Based on this, a simplified interface became obvious:  for each
> hardware context, track a mask of kinds of resets that have affected
> that context.  I defined three bits in the mask:

The problem is that you added an atomic-read-and-reset into the ioctl. I
objected to that policy when Mika presented it earlier as it prevents
concurrent accesses and so imposes an explicit usage model.

> 1. The hw context had a batch executing when the reset occurred.
> 
> 2. The hw context had a batch pending when the reset occurred.
> Hopefully in the future we could make most occurrences of this go
> away by re-queuing the batch.  It may also be possible to do this in
> user-mode, but that may be more difficult.
> 
> 3. "Other."
> 
> There's also the potential to add other bits to communicate
> information about lost buffers, textures, etc.  This could be used
> to create a layered extension that lets applications transfer data
> from the dead context to the new context.  Browsers may not be too
> interested in this, but I could imagine compositors or other kinds
> of applications benefiting.  We may never implement that, but it's
> easier to communicate this through a set of flags than through a set
> of counts.
> 
> The guilty / innocent counts in this patch series lose information.

Disagree. The flags represent a lose of information, as you explained
earlier.

> We can probably reconstruct the current flags from the counts, but
> we would have to do some refactoring (or additional counting) to get
> other flags in the future.  All of this would make the kernel code
> more complex. Why reconstruct the data you want when you could just
> track that data in the first place?

Because you are constructing policy decision based on raw information.

> Since the functionality in not available on all hardware, I also
> added a query to determine the availability.  This has the added
> benefit that, should this interface prove to be insufficient in the
> future, we could deprecate it by having the query always return
> false.

That offers no more information than just probing the ioctl.

> All software involved has to be able the handle the case
> where the interface is not available, so I don't think this should
> run awry of the rules about kernel/user interface breakage.  Please
> correct me if I'm wrong.

[snip patch locations]

The biggest change you made was to reduce the counts to a set of flags,
to make it harder for an attacker to analyse the reset frequency.
Instead that attacker can just poll the flags and coarsely reconstruct
the counts... If that is their goal. Instead it transfers userspace
policy into the kernel, which is what we were seeking to avoid, right?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre