On Thu, Oct 08, 2015 at 11:46:08PM +0100, David Woodhouse wrote: > On Thu, 2015-10-08 at 12:29 +0100, Tomas Elf wrote: > > > > Could someone clarify what this means from the TDR point of view, > > please? When you say "context blew up" I'm guessing that you mean that > > come context caused the fault handler to get involved somehow? > > > > Does this imply that the offending context will hang and the driver will > > have to detect this hang? If so, then yes - if we have the per-engine > > hang recovery mode as part of the upcoming TDR work in place then we > > could handle it by stepping over the offending batch buffer and moving > > on with a minimum of side-effects on the rest of the driver/GPU. > > I don't think the context does hang. > > I've made the page-request code artificially fail and report that it > was an invalid page fault. The gem_svm_fault test seems to complete > (albeit complaining that the test failed). Whereas if I just don't > service the page-request at all, *then* the GPU hang is detected. > > I haven't actually looked at precisely what *is* happening. Hm if this still works the same way as on older platforms then pagefaults just read all 0 and writes go nowhere from the gpu. That generally also explains ever-increasing numbers of the CS execution pointer since it's busy churning through 48b worth of address space filled with MI_NOP. I'd have hoped our hw would do better than that with svm ... If there's really no way to make it hang when we complete the fault then I guess we'll have to hang it by not completing. Otherwise we'll have to roll our own fault detection code right from the start. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx