Re: drm/nouveau: Possible hardware corruption of older GeForce card

rh <richard_hubbe11@xxxxxxxxxxx> · Fri, 22 Mar 2013 13:54:28 -0700

On Fri, 22 Mar 2013 14:54:03 -0500
Calvin Owens <jcalvinowens@xxxxxxxxx> wrote:

> On 03/21/13 02:56, Calvin Owens wrote:
> > On 03/21/13 02:24, Calvin Owens wrote:
> >> On 03/21/13 01:59, Ben Skeggs wrote:
> >>> On Thu, 2013-03-21 at 01:34 -0500, Calvin Owens wrote:
> >>>> DRM hasn't worked on my desktop machine (GeForce 9800) with
> >>>> Nouveau for a little while (v3.9-rc1 didn't), but worked as of
> >>>> commit e204378 on Linus' tree for one boot, and subsequently
> >>>> always fails.
> >>>>
> >>>> After running that version, v3.6, which has always worked in the
> >>>> past, also fails, which is obviously somewhat troubling.
> >>>>
> >>>> The card will POST, but when modesetting tries to happen, errors
> >>>> result and the console remains in VGA mode. On a second computer
> >>>> (on which I have also used this card in the past), I now get the
> >>>> same "PRAMIN readback failed" error and no DRM console.
> >>>>
> >>>> I don't want to get ahead of myself, since I have no idea what
> >>>> exactly is happening, but it certainly appears that booting
> >>>> e204378 somehow changed something on the hardware that is
> >>>> preventing nouveau modesetting from being successful in that and
> >>>> previous vesrions of the kernel.
> >>>>
> >>>> I was going to add debugging output from the nouveau tree HEAD,
> >>>> but it locks the machine hard with strange visual artifacts. Any
> >>>> other info I can provide? Any idea where I should start digging?
> >>>>
> >>>> Thanks,
> >>>> Calvin Owens
> >>>>
> >>>> (e204378 just happened to be HEAD when I pulled from Linus'
> >>>> tree; I can't narrow it down to something in Nouveau or DRM
> >>>> since I don't yet know how to undo the apparent hardware
> >>>> alteration)
> >>> Does doing a complete cold boot fix things temporarily until you
> >>> run with that revision again, or, is it of a permanent nature?
> >> No, it seems to be permanent.
> >>
> >>> If it's the latter, it sounds more like the hw (specifically, the
> >>> ram chips) is dying honestly...
> >>
> >> Is there some way to test that? The suddenness of it is what made
> >> me discount the possibility that the chip is dying - I've used
> >> this card almost daily for years in that desktop, so I would've
> >> expected intermittent failures rather than a sudden cutoff... but
> >> you could be right.
> >>
> > 
> > Just noticed this: the semicolon fixed below causes
> > nv50_display_flip_stop to return immediately instead of waiting for
> > the memory writes to appear, which may be the cause of some of those
> > DMA-related errors I was seeing. (I'll resend the patch separately)
> > 
> > diff --git a/drivers/gpu/drm/nouveau/nv50_display.c
> > b/drivers/gpu/drm/nouveau/nv50_display.c
> > index 2db5799..96bc2f3 100644
> > --- a/drivers/gpu/drm/nouveau/nv50_display.c
> > +++ b/drivers/gpu/drm/nouveau/nv50_display.c
> > @@ -479,7 +479,7 @@ nv50_display_flip_wait(void *data)
> >  {
> >  	struct nv50_display_flip *flip = data;
> >  	if (nouveau_bo_rd32(flip->disp->sync, flip->chan->addr /
> > 4) ==
> > -					      flip->chan->data);
> > +					      flip->chan->data)
> >  		return true;
> >  	usleep_range(1, 2);
> >  	return false;
> > 
> 
> I hope this whole thing doesn't seem to melodramatic... my thinking
> was simply that, given that MTBF's of video cards tend to be quoted
> in the 5+ year range, the probability of one dying in the 2-hour
> window between my rebooting with that kernel version had to be
> extremely low. It seemed as though the null hypothesis was rather
> certainly disproven.
> 
> Is this worth pursuing? I can try to find an identical card and see
> if it happens again... but if you're convinced it's a non-issue, I'll
> just forget about it.

Are you saying your card is bricked?

> 
> Thanks,
> Calvin
> 
> >>> Ben.

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel