On Tue, Jun 11, 2013 at 4:16 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote: > On Tue, Jun 11, 2013 at 04:05:41PM +0200, Daniel Vetter wrote: >> On Tue, Jun 11, 2013 at 02:40:19PM +0100, Chris Wilson wrote: >> > Not sure what you mean here. The check is fairly easy and has gotten us >> > out of many a hole before, and makes for a good defense. So how would >> > you want to fine tune it? >> >> Something like the MI_WAIT hangcheck score, but like I've said as long as >> we don't have a real-world bug report (some poor guy disabled semaphores >> maybe due to the snb issue?) not worth bothering at all. >> >> I've just thought that if we're unlucky and miss the interrupt a few times >> in a row we don't want to accidentally declare the gpu dead. > > I regarded it as a driver bug, that a GPU reset would not help. So the > choice is between limping along with the hopefully occasional stall, or > terminating the GPU with extreme prejudice. I chose the former, hence > did not increment the hangcheck. Hm, maybe I'm reading the logic wrongly, but don't we add a += HUNG score now for a stuck, but idle ring? So pretty short of declaring the thing dead? Ofc there's the slow decline if the gpu isn't actually dead, but if we have more than 1 such stall every HUNG (=20) hangcheck times we'll eventually declare it dead despite the limping along. Anyway nothing to really worry about, just wanted to check my understanding here. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch