On Mon, May 06, 2013 at 11:44:22AM +0200, Daniel Vetter wrote: > On Mon, May 06, 2013 at 11:40:06AM +0200, Daniel Vetter wrote: > > On Thu, May 02, 2013 at 01:27:32PM -0700, Jesse Barnes wrote: > > > On Tue, 23 Apr 2013 23:15:29 -0700 > > > Ben Widawsky <ben at bwidawsk.net> wrote: > > > > > > > Because our context refcounting doesn't grab a ref at lookup time, it is > > > > unsafe to do so without the lock. > > > > > > > > NOTE: We don't have an easy way to put the assertion in the lookup > > > > function which is where this really belongs. Context switching is good > > > > enough because it actually asserts even more correctness by protecting > > > > the default_context. > > > > > > > > Signed-off-by: Ben Widawsky <ben at bwidawsk.net> > > > > --- > > > > drivers/gpu/drm/i915/i915_gem_context.c | 2 ++ > > > > 1 file changed, 2 insertions(+) > > > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c > > > > index a1e8ecb..411ace0 100644 > > > > --- a/drivers/gpu/drm/i915/i915_gem_context.c > > > > +++ b/drivers/gpu/drm/i915/i915_gem_context.c > > > > @@ -444,6 +444,8 @@ int i915_switch_context(struct intel_ring_buffer *ring, > > > > if (dev_priv->hw_contexts_disabled) > > > > return 0; > > > > > > > > + BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex)); > > > > + > > > > if (ring != &dev_priv->ring[RCS]) > > > > return 0; > > > > > > > > > > Simple enough. > > > > > > Reviewed-by: Jesse Barnes <jbarnes at virtuousgeek.org> > > > > > > We usually do WARN_ONs for this stuff though, in case a user actually > > > does hit it, it may not be fatal so why crash the machine? > > > > > > But that's a minor distinction since we shouldn't hit this except in > > > development anyway. > > > > Well, since this is a patch for upstream the focus should very much be on > > supporting bug reporters and not developers. And for bug reporters a BUG > > is much more annoying than a WARN and greatly reduces the chances that > > we'll get a bug report. > > Some more details why a WARN massively beats a BUG for us: BUG kills the > current process and ensures all locks are stuck. Usually that means X is > dead and you can't vt-switch away to the console to take a quick look at > dmesg. > > Now even when all rendering is down the toilet due to the follow-up damage > after the WARN and the gpu a zombie, there's a non-zero chance that > vt-switch (or sw rendering in X) will work long enough to grab log files > and debugfs data. > > Hence the first rule to only use a BUG on if we have a guaranteed OOPS > otherwise (which again will kill the process and make all locks stuck). > -Daniel > > > > > There are imo only very few cases where a BUG instead of a WARN is > > justified: > > - The kernel is _guaranteed_ to oops in the next few lines anyway, so a > > BUG_ON will help in readability of the backtrace. Note that checking 3 > > different things for non-NULL in the same BUG actually reduces OOPS > > readability (with an oops you can at least reconstruct the faulting > > address and so probably the pointer). Also, this means the BUG should > > have a neat description of what exactly blew up. > > > > The "a few lines" part is just a guideline with some big exceptions. A > > prime example is refcount over/underflows since those will blow up, but > > only sometimes later (and usually no one will have a clue why). > > > > - BUGs are justified if there's a potential security hole awaiting, e.g. > > when something in the userspace input validation has gone wrong. > > > > - I'm wary of special error handling for WARNs, but if an early return > > (with an error code if possible) transforms a BUG into a WARN I'm in. > > But trying to fix up e.g. modeset state is usually futile, since we'll > > end up with a black/fuzzy/corrupted screen most likely anyway. > > > > But the most important rule is: In case of doubt, just WARN, don't BUG. > > > > [I know, I violate it sometimes, too.] > > > > Patch applied with the s/BUG/WARN bikeshed. > > -Daniel Thanks for applying the patch, it's certainly better than what we have currently. Why I wanted a BUG: When you get a ref to an object without holding a lock you get a potentially unsafe pointer (to which we will be writing). If the context object memory is freed, and we write to it, we have a potential to late scribble over <insert your file system of choice> memory. There is probably a similar security implication there as well. Many of us are used to, and capable of recovering from GPU hangs, but less of us like to deal with FS recovery. I actually believe all "get" code like this (backed with refcounts) should BUG and not WARN. -- Ben Widawsky, Intel Open Source Technology Center