On Tue, Jun 16, 2015 at 08:33:13PM +0100, Chris Wilson wrote: > On Tue, Jun 16, 2015 at 06:32:22PM +0100, Tomas Elf wrote: > > On 16/06/2015 17:55, Chris Wilson wrote: > > >On Tue, Jun 16, 2015 at 04:43:55PM +0100, Tomas Elf wrote: > > >>On 16/06/2015 14:43, Daniel Vetter wrote: > > >>>On Mon, Jun 08, 2015 at 06:03:20PM +0100, Tomas Elf wrote: > > >>>>The TDR ULT used to validate this patch series requires a special uevent for > > >>>>full GPU resets in order to distinguish between different kinds of resets. > > >>>> > > >>>>Signed-off-by: Tomas Elf <tomas.elf@xxxxxxxxx> > > >>> > > >>>Why duplicate the uevent we send out from i915_reset_and_wakeup? At least > > >>>I can't spot what this gets us in addition to the existing one. > > >>>-Daniel > > >> > > >>Look at this line: > > >>>>+ reset_event[0] = kasprintf(GFP_KERNEL, "%s", "GPU RESET=0"); > > >> > > >>It doesn't exist in reset_and_wakeup (specifically, the "GPU > > >>RESET=0" part). It's a uevent that happens at the time of the actual > > >>GPU reset (GDRST register write). In the subsequent TDR commit we > > >>add another one to the point of the actual engine reset, which also > > >>includes information about what exact engine was reset. > > >> > > >>The uevents in reset_and_wakeup only tell the user that an error has > > >>been detected and that some kind of reset has happened, these new > > >>uevents specify exactly what kind of reset has happened. This > > >>particular one on its own it's not very meaningful since there is > > >>only one supported form of reset at this point but once we add > > >>engine reset support it's useful to be able to discern the types of > > >>resets from each other (GPU reset, RCS engine reset, VCS engine > > >>reset, VCS2 engine reset, BCS engine reset, VECS engine reset). > > >> > > >>Does that make sense? > > > > > >The ultimate question is how do you envisage these uevents being used? > > > > > >At present, we have abrtd listening out for when to grab the > > >/sys/drm/cardX/error and maybe for GL robustness (though I would imagine > > >if they thought such events useful we would have had demands for a DRM > > >event on the guilty/victim fd). > > > > > >Does it really make sense to send uevents for both hang, partial-reset, > > >and full-reset? > > >-Chris > > > > > > > The reason we have such a detailed set of uevents is primarily for > > testing purposes. Our internal VPG tests check for these uevents to > > make sure that the expected recovery mode is actually being used. > > Which makes sense, because the TDR driver code contains reset > > promotion logic to decide what recovery mode to use and if that > > logic somehow gets broken the driver might go with the wrong > > recovery mode. Thus, it's worth testing and therefore those uevents > > need to be there. Of course, I guess the argument "our internal VPG > > tests do this" might not hold water since the tests haven't been > > upstreamed? If that's the case then I guess I don't have any opinion > > about what uevent goes where and we could go with whatever set of > > uevents you prefer. > > uevents aren't free, so if we can find a way to get coverage without > waking up the buses, I'd be happy (though we shouldn't need that many > reset uevents, it is mainly the principle of trying to use the right > tool for the job and not just the first one that comes to hand). Also > uevents are by their very nature userspace ABI so we should plan > carefully how we expect them to be used. Yeah uevents are ABI with full backwards compat guarantees, I don't want to add ABI just for testing. What we've done for the arb robustness testcaes is batches with canary writes to observe whether all the right ones have been cancelled/killed depending upon where the hanging batch is. Imo that's also more robust testing - we have behavioural expectations, tight coupling of the tests with the current kernel implementations only causes long-term maintainance headaches. E.g. testcase that expects a engine reset: - Submit dummo workload on ring A followed by a canary write. - Submit hanging batch on ring B. - Wait for reset. - Check that canary write in ring A happened. Testcase which expects full-blown gpu reset: - Check that canary write didn't happen. Variate as need and correlate with the reset stats the kernel reports. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx