On 11/24/2015 05:26 AM, Imre Deak wrote:
On ti, 2015-11-24 at 14:04 +0100, Daniel Vetter wrote: > On Mon, Nov 23, 2015 at 03:02:58PM -0800, yu.dai@xxxxxxxxx wrote: > > From: Alex Dai <yu.dai@xxxxxxxxx> > > > > When GuC Work Queue is full, driver will wait GuC for avaliable > > space by delaying 1ms. The wait needs to be out of spinlockirq / > > unlock. Otherwise, lockup happens because jiffies won't be updated > > dur to irq is disabled. > > > > Issue is found in igt/gem_close_race. > > > > Signed-off-by: Alex Dai <yu.dai@xxxxxxxxx> > > --- > > drivers/gpu/drm/i915/i915_guc_submission.c | 27 +++++++++++++++++- > > --------- > > 1 file changed, 17 insertions(+), 10 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c > > b/drivers/gpu/drm/i915/i915_guc_submission.c > > index 0a6b007..1418397 100644 > > --- a/drivers/gpu/drm/i915/i915_guc_submission.c > > +++ b/drivers/gpu/drm/i915/i915_guc_submission.c > > @@ -201,10 +201,13 @@ static int guc_ring_doorbell(struct > > i915_guc_client *gc) > > union guc_doorbell_qw *db; > > void *base; > > int attempt = 2, ret = -EAGAIN; > > + unsigned long flags; > > > > base = kmap_atomic(i915_gem_object_get_page(gc- > > >client_obj, 0)); > > We don't need kmap_atomic anymore here now, since it's outside of the > spinlock. > > > desc = base + gc->proc_desc_offset; > > > > + spin_lock_irqsave(&gc->wq_lock, flags); > > Please don't use the super-generic _irqsave. It's expensive and > results in > fragile code when someone accidentally reuses something in an > interrupt > handler that was never meant to run in that context. > > Instead please use the most specific funtion: > - spin_lock if you know you are in irq context. > - sipn_lock_irq if you know you are not. Right, and simply spin_lock() if the lock is not taken in IRQ context ever.
This is not in IRQ context. So I will use spin_lock_irq instead.
> - spin_lock_irqsave should be a big warning sign that your code has > layering issues. > > Please audit the entire guc code for the above two issues. Agreed, it looks inconsistent atm: we do spin_lock(wq_lock) from debugfs and spin_lock_irq(wq_lock) from i915_guc_submit(). Neither of them are called from IRQ context AFAICS, in which case a simple spin_lock() would do. --Imre > > + > > /* Update the tail so it is visible to GuC */ > > desc->tail = gc->wq_tail; > > > > @@ -248,7 +251,10 @@ static int guc_ring_doorbell(struct > > i915_guc_client *gc) > > db_exc.cookie = 1; > > } > > > > + spin_unlock_irqrestore(&gc->wq_lock, flags); > > + > > kunmap_atomic(base); > > + > > return ret; > > } > > > > @@ -487,16 +493,16 @@ static int guc_get_workqueue_space(struct > > i915_guc_client *gc, u32 *offset) > > struct guc_process_desc *desc; > > void *base; > > u32 size = sizeof(struct guc_wq_item); > > - int ret = 0, timeout_counter = 200; > > + int ret = -ETIMEDOUT, timeout_counter = 200; > > + unsigned long flags; > > > > base = kmap_atomic(i915_gem_object_get_page(gc- > > >client_obj, 0)); > > desc = base + gc->proc_desc_offset; > > > > while (timeout_counter-- > 0) { > > - ret = wait_for_atomic(CIRC_SPACE(gc->wq_tail, > > desc->head, > > - gc->wq_size) >= size, 1); > > + spin_lock_irqsave(&gc->wq_lock, flags); > > > > - if (!ret) { > > + if (CIRC_SPACE(gc->wq_tail, desc->head, gc- > > >wq_size) >= size) { > > *offset = gc->wq_tail; > > > > /* advance the tail for next workqueue > > item */ > > @@ -505,7 +511,13 @@ static int guc_get_workqueue_space(struct > > i915_guc_client *gc, u32 *offset) > > > > /* this will break the loop */ > > timeout_counter = 0; > > + ret = 0; > > } > > + > > + spin_unlock_irqrestore(&gc->wq_lock, flags); > > + > > + if (timeout_counter) > > + usleep_range(1000, 2000); > > Do we really not have a interrupt/signal from the guc when it has > cleared > up some space? >
This is not implemented in fw although I think it could be done through the guc to host interrupt. I am worry about that if we implement this, it will end up with driver handles too many interrupts (maybe same amount of context switch). However, ideally we don't want to handle interrupts at all.
> > }; > > > > kunmap_atomic(base); > > @@ -597,19 +609,17 @@ int i915_guc_submit(struct i915_guc_client > > *client, > > { > > struct intel_guc *guc = client->guc; > > enum intel_ring_id ring_id = rq->ring->id; > > - unsigned long flags; > > int q_ret, b_ret; > > > > /* Need this because of the deferred pin ctx and ring */ > > /* Shall we move this right after ring is pinned? */ > > lr_context_update(rq); > > > > - spin_lock_irqsave(&client->wq_lock, flags); > > - > > q_ret = guc_add_workqueue_item(client, rq); > > if (q_ret == 0) > > b_ret = guc_ring_doorbell(client); > > > > + spin_lock(&guc->host2guc_lock); > > So at first I thought there's a race now, but then I looked at what > host2guc and wq_lock protect. It seems like the only thing they do is > protect against debugfs, all the real protection against inconsistent > state is done through dev->struct_mutex. > > Can't we just rip out all this spinlock business from the guc code? > It would be easier than fixing up the races in here.
Yes, host2guc lock can be done through dev->struct_mutex. But definitely we don't want to interrupt the process when driver program guc work queue and ring the door bell.
> -Daniel > > > client->submissions[ring_id] += 1; > > if (q_ret) { > > client->q_fail += 1; > > @@ -620,9 +630,6 @@ int i915_guc_submit(struct i915_guc_client > > *client, > > } else { > > client->retcode = 0; > > } > > - spin_unlock_irqrestore(&client->wq_lock, flags); > > - > > - spin_lock(&guc->host2guc_lock); > > guc->submissions[ring_id] += 1; > > guc->last_seqno[ring_id] = rq->seqno; > > spin_unlock(&guc->host2guc_lock); > > -- > > 2.5.0 > > >
_______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx