Re: [PATCH v4] drm/panfrost: Move the GPU reset bits outside the timeout handler

Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> · Thu, 5 Nov 2020 14:39:53 +0100

On Thu, 5 Nov 2020 13:27:04 +0000
Steven Price <steven.price@xxxxxxx> wrote:

> > +	old_status = atomic_xchg(&queue->status,
> > +				 PANFROST_QUEUE_STATUS_STOPPED);
> > +	WARN_ON(old_status != PANFROST_QUEUE_STATUS_ACTIVE &&
> > +		old_status != PANFROST_QUEUE_STATUS_STOPPED);
> > +	if (old_status == PANFROST_QUEUE_STATUS_STOPPED)
> > +		goto out;  
> 
> NIT: It's slightly cleaner if you swap the above lines, i.e.:
> 
> 	if (old_status == PANFROST_QUEUE_STATUS_STOPPED)
> 		goto out;
> 	WARN_ON(old_status != PANFROST_QUEUE_STATUS_ACTIVE);

I agree.

> 
> > +
> > +	drm_sched_stop(&queue->sched, bad);
> > +	if (bad)
> > +		drm_sched_increase_karma(bad);
> > +
> > +	stopped = true;
> > +
> > +	/*
> > +	 * Set the timeout to max so the timer doesn't get started
> > +	 * when we return from the timeout handler (restored in
> > +	 * panfrost_scheduler_start()).
> > +	 */
> > +	queue->sched.timeout = MAX_SCHEDULE_TIMEOUT;
> > +
> > +out:
> >   	mutex_unlock(&queue->lock);
> >   
> >   	return stopped;
> >   }
> >   
> > +static void panfrost_scheduler_start(struct panfrost_queue_state *queue)
> > +{
> > +	enum panfrost_queue_status old_status;
> > +
> > +	mutex_lock(&queue->lock);
> > +	old_status = atomic_xchg(&queue->status,
> > +				 PANFROST_QUEUE_STATUS_STARTING);
> > +	if (WARN_ON(old_status != PANFROST_QUEUE_STATUS_STOPPED))
> > +		goto out;  
> 
> The error handling isn't great here - in this case the queue status is 
> left in _STATUS_STARTING, which at best would lead to another WARN_ON 
> being hit, but also has the effect of ignoring job faults. Probably the 
> timeout would eventually get things back to normal.
> 
> Obviously this situation will never occur™, but we can do better either 
> by continuing with the normal logic below, or even better replacing 
> atomic_xchg() with an atomic_cmpxchg() (so leave the status alone if not 
> _STOPPED). Both seem like better error recovery options to me. But keep 
> the WARN_ON because something has clearly gone wrong if this happens.

The second approach doesn't unblock things if we end up with
old_status != STOPPED and the queue is really stopped (which shouldn't
happen, unless we have a problem in our state machine). I think I'll
go for the first option and restart the queue unconditionally (I'm
keeping the WARN_ON(), of course).