Re: Radeon lockup on 3.8.5-201.fc18.x86_64

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Tue, 23 Apr 2013 12:31:02 -0700

On Tue, Apr 23, 2013 at 10:15 AM, Michel Dänzer <michel@xxxxxxxxxxx> wrote:
> On Die, 2013-04-23 at 10:08 -0700, Andy Lutomirski wrote:
>> On Mon, Apr 22, 2013 at 10:55 PM, Michel Dänzer <michel@xxxxxxxxxxx> wrote:
>> > On Mon, 2013-04-22 at 16:19 -0700, Andy Lutomirski wrote:
>> >
>> >> I'm not convinced there's an actual hang.  40 seconds is a long time,
>> >> and I've only ever seen this when clicking something, and when this
>> >> happens, the screen goes blank immediately (not after a 40 second
>> >> delay).
>> >
>> > Hmm, now that you mention this, I notice in your original report it
>> > claims that the CP stalled for 'more than 5102593msec', which is clearly
>> > bogus. Looks like something's wrong with the lockup detection.
>> > Did this start after a kernel update or something like that?
>>
>> It's recent.  It may have been when F18 switched from 3.7 to 3.8.
>
> Can you reproduce it with an upstream kernel? Can you bisect? I realize
> it'll probably take a long time, but unless someone has an idea which
> change might have introduced the problem...

Yuck.  I can try, but it takes days to reproduce this, so it will take
forever (and may end up with a wrong answer if I get lucky and don't
crash).

>
>
>> I think there are bugs in the lockup detection and in the lockup
>> recovery.  Firefox, in particular, is *really* slow afterwards.  Are
>> interrupts possibly getting dropped or misconfigured during the reset?
>
> Let's not get ahead of ourselves and focus on the lockup detection issue
> for now.

I don't understand the r600_gpu_check_soft_reset code, but could this
be the sequence of events that triggers it?

1. radeon_ring_is_lockup is called just as the very last command on
the ring completes, so last_rptr gets set to the rptr.
2. Nothing happens for a while (i.e. > lockup_timeout).  rptr doesn't change.
3. A very slightly slow operation starts.
4. radeon_ring_is_lockup gets called before that command completes.

radeon_ring_test_lockup will not detect a jiffies wrap-around (because
there wasn't one), rptr will equal last_rptr (because there hasn't
been any progress since last time), and the elapsed time will be
really long, because the function hasn't been called for a long time.
So a lockup gets detected, even though nothing's wrong.

There's a comment above radeon_ring_test_lockup that says:

 * A possible false positivie is if we get call after while and last_cp_rptr ==
 * the current CP rptr, even if it's unlikely it might happen. To avoid this
 * if the elapsed time since last call is bigger than 2 second than we return
 * false and update the tracking information. Due to this the caller must call
 * radeon_ring_test_lockup several time in less than 2sec for lockup
to be reported
 * the fencing code should be cautious about that.

but the corresponding code doesn't appear to exist anywhere.

Also, and unrelatedly, I revoke my comment about gmail issues being
fixed with hyperz off.  Gmail still draws incorrectly.  This may or
may not have anything to do with the radeon driver.

--Andy
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel