On Mon, Dec 16, 2013 at 04:02:20PM +0000, Lister, Ian wrote: > This patchset contains TDR and Watchdog reset against a 3.13 > drm-intel-nightly tree that is now about 2 weeks old. > > I have re-worked the TDR and Watchdog Reset features to integrate > them more closely with the existing TDR and scoring mechanism. > > This is still a work-in-progress and I am currently debugging a couple > of issues but I would like to get some early feedback. > > Thanks, > Ian. > > From f71a7de85e9d81be3aa3962c8fe2557235ff21c1 Mon Sep 17 00:00:00 2001 > Message-Id: <cover.1387201899.git.ian.lister@xxxxxxxxx> > From: ian-lister <ian.lister@xxxxxxxxx> > Date: Mon, 16 Dec 2013 13:51:39 +0000 > Subject: [RFC 00/13] TDR and Watchdog Reset > > This patchset adds support for per-engine timeout detection and recovery > and adds batch specific watchdog reset. Ok, I've read through the patches and dropped my high-level maintainer review on them. Overall this looks much better than the first tdr rfc and was a nice read. The few misplaced hunks (probably due to some rebase mixup) didn't really confuse the overall picture. For the detailed review I'll sign up Mika, he's been doing most of the reset/hangcheck related changes recently in upstream. The really tricky bits around the hangcheck code are the ordering constraints in combination with lock taking/releasing. Chris, Ville and me have mostly reviewed those parts, so I'll definitely double-check that before merging. For merging the patches I always prefer an incremental approach once early pieces are ready with review and testcases. Below a few more comments on the features overall, mostly around testing. > Per-ring TDR > The detection logic has been modified to detect hangs on individual > engines and pass this information through the to the recovery handler. > Rather than a global reset it will attempt a per-engine reset. The > registers associated with the ring are saved and restored so that > when the ring restarts it continues from the next instruction in the > ring. For example, if it was executing an MI_START_BATCH_BUFFER command > it will advance to the next instruction which is likely to be the > mailbox updates and user interrupt. This means that no extra effort > is required to deal with synchronisation. From the perspective of the > driver it looks like the batch buffer completed normally as all the > normal signalling will take place, however the context stats will > have been updated to flag up the guilty context. hangcheck code has always been terribly tricky, so I want solid test coverage for corner cases here. I think the following should do the trick: - Testcase that exercise some of the starvation issues you fix in the first patches. That looks like a real bug, so I also prefer to merge those patches as soon as they're ready. - Patches to exercise the per-ring reset stuff by using different contexts/file descriptors and checking that unrelated work does get completed correctly. The tricky bits here are pageflips (we already have a basic testcase for that) and semaphores. For semaphores we currently lack a real functional test, but Damien is signed up to write that once he's back from vacation. So probably useful to sync up with him. For pageflip tests you'll get style points if you use the CRC support to check for functional correctness of which buffer gets flipped to the front in case of hangs, but imo that's overkill. The pageflip tests are all subtests of kms_flip with "hang" somewhere in the name. The downside for Android is that this uses cairo, so atm it's not enabled. But apparently some group in VPG has libcairo ported already, so hopefully this will be sorted out soon. I guess you already have some other validation tests around, porting these to i-g-t (and then using a common codebase) would also be great (I don't yet have source access, but should have it soon). Given that we don't expose the hangcheck improvements to userspace (besides the fact that fewer contexts will be victimized) there's no need in that area for testcases. > Watchdog Reset > This is requested via flags to the batch buffer submission IOCTL. > It is currently only supported for the render and video rings. > The batch buffer command is surrounded by a hardware timer start > command and stop command. If the batch completes before the timer > expires then the timer is cancelled and no interrupt is generated > so everything continues normally. However if the batch hangs then > the timer will generate an interrupt and it will trigger an engine > reset. This feature requires per-ring TDR to do the recovery work. First a comment on the interface: I have honestly lost track of where exactly the watchdog threshold gets set, but I think we should leave that option to userspace. For testcases we need the usual mix for new features: - Exercising abi cornercases (especially rejecting evil/stupid data from userspace). - Some functional check. With the rendercpy/mediafill infrastructure it should be fairly easy to construct RCS/VCS batches which take way too long. The real big deal here is though that for this new abi extension I need an open-source user. So libva needs to be ported to use this feature for both it's vcs and rcs batches. Without an open-source user I can't merge this. The actual merge order for abi extensions is also rather strict: 1. Post both kernel patches and UMD enabling patches for a given feature. 2. Once the entire slice (so both KMD and UMD parts) is reviewed and tested the kernel parts get merged. 3. As soon as the kernel patches hit a stable git branch (i.e. drm-intel-next) we can merge the libdrm patches. 4. After a libdrm release the actual UMD patches can go in. A month ago we've had a little screw up with that and Dave Airlie (drm upstream maintainer) stepped in an reverted a few things (resulting in some temporary build breakage). So following the rules here is important. Thanks, Daniel PS: One thing I've forgotten in all this: Like with the from address in the patches your name in the s-o-b lines should be "Ian Lister" instead of "ian-lister". > > ian-lister (13): > drm/i915: Periodic sampling for hang detection > drm/i915: Improved hang detection logic > drm/i915: Additional ring operations for TDR > drm/i915: Force wake restore for TDR > drm/i915: Per-engine recovery > drm/i915: Communicating reset requests > drm/i915: Additional debug for TDR > drm/i915: TDR loose ends > drm/i915: Watchdog timer support functions > drm/i915: MI_LOAD_REGISTER_IMM fix > drm/i915: Added watchdog interrupt handling > drm/i915: Enabled watchdog timer interrupts > drm/i915: Exec buffer inserts watchdog commands > > drivers/gpu/drm/i915/i915_debugfs.c | 67 ++++ > drivers/gpu/drm/i915/i915_dma.c | 3 + > drivers/gpu/drm/i915/i915_drv.c | 46 +++ > drivers/gpu/drm/i915/i915_drv.h | 41 ++- > drivers/gpu/drm/i915/i915_gem.c | 26 +- > drivers/gpu/drm/i915/i915_gem_execbuffer.c | 30 +- > drivers/gpu/drm/i915/i915_irq.c | 531 > ++++++++++++++++++--------- > drivers/gpu/drm/i915/i915_reg.h | 21 ++ > drivers/gpu/drm/i915/intel_display.c | 30 +- > drivers/gpu/drm/i915/intel_ringbuffer.c | 557 > +++++++++++++++++++++++++++-- > drivers/gpu/drm/i915/intel_ringbuffer.h | 53 +++ > drivers/gpu/drm/i915/intel_uncore.c | 373 ++++++++++++++++++- > include/drm/drmP.h | 7 + > 13 files changed, 1575 insertions(+), 210 deletions(-) > > -- > 1.8.5.1 > > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@xxxxxxxxxxxxxxxxxxxxx > http://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx