On Tue, Jun 23, 2015 at 11:47:16AM +0100, Tomas Elf wrote: > On 23/06/2015 11:05, Daniel Vetter wrote: > >Your patches don't apply cleanly any more and I can't find a suitable > >baseline where they would. But I'd like to see it all in context to check > >a few things. > > > >Can you pls push a git branch with these somewhere? > > > > Here's the baseline for my local tree: > cd07637 - drm-intel-nightly: 2015y-04m-13d-09h-46m-59s UTC integration > manifest <daniel.vetter@xxxxxxxx> I don't have that baseline around here (any more at least). Happens regularly with rebasing trees. > I haven't updated it in a while obviously since I thought that could wait > until we'd worked our way through the RFC series and I could get to work on > the first real patch series. > > Is it possible for you to set up a local tree of your own with my baseline > and my RFC patches on top or would you prefer it if I push my branch to > drm-private? So yeah I need your branch ;-) -Danil > > Thanks, > Tomas > > >Thanks, Daniel > > > >On Mon, Jun 08, 2015 at 06:03:18PM +0100, Tomas Elf wrote: > >>This patch series introduces the following features: > >> > >>* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode. > >> > >>TDR is an umbrella term for anything that goes into detecting and recovering > >>from GPU hangs and is a term more widely used outside of the upstream driver. > >>This feature introduces an extensible framework that currently supports gen8 > >>but that can be easily extended to support gen7 as well (which is already the > >>case in GMIN but unfortunately in a not quite upstreamable form). The code > >>contained in this submission represents the essentials of what is currently in > >>GMIN merged with what is currently in upstream (as of the time when this work > >>commenced a few months back). > >> > >>This feature adds a new hang recovery path alongside the legacy GPU reset path, > >>which takes care of engine recovery only. Aside from adding support for > >>per-engine recovery this feature also introduces rules for how to promote a > >>potential per-engine reset to a legacy, full GPU reset. > >> > >>The hang checker now integrates with the error handler in a slightly different > >>way in that it allows hang recovery on multiple engines at the same time by > >>passing an engine flag mask to the error handler where flags representing all > >>of the hung engines are set. This allows us to schedule hang recovery once for > >>all currently hung engines instead of one hang recovery per detected engine > >>hang. Previously, when only full GPU reset was supported this was all the same > >>since it wouldn't matter if one or four engines were hung at any given point > >>since it would all amount to the same thing - the GPU getting reset. As it > >>stands now the behaviour is different depending on which engine is hung since > >>each engine is reset separately from all the other engines, therefore we have > >>to think about this in terms of scheduling cost and recovery latency. (see > >>open question below) > >> > >>OPEN QUESTIONS: > >> > >> 1. Do we want to investigate the possibility of per-engine hang > >> detection? In the current upstream driver there is only one work queue > >> that handles the hang checker and everything from initial hang > >> detection to final hang recovery runs in this thread. This makes sense > >> if you're only supporting one form of hang recovery - using full GPU > >> reset and nothing tied to any particular engine. However, as part > >> of this patch series we're changing that by introducing per-engine > >> hang recovery. It could make sense to introduce multiple work > >> queues - one per engine - to run multiple hang checking threads in > >> parallel. > >> > >> This would potentially allow savings in terms of recovery latency since > >> we don't have to scan all engines every time the hang checker is > >> scheduled and the error handler does not have to scan all engines every > >> time it is scheduled. Instead, we could implement one work queue per > >> engine that would invoke the hang checker that only checks _that_ > >> particular engine and then the error handler is invoked for _that_ > >> particular engine. If one engine has hung the latency for getting to > >> the hang recovery path for that particular engine would be (Time For > >> Hang Checking One Engine) + (Time For Error Handling One Engine) rather > >> than the time it takes to do hang checking for all engines + the time > >> it takes to do error handling for all engines that have been detected > >> as hung (which in the worst case would be all engines). There would > >> potentially be as many hang checker and error handling threads going on > >> concurrently as there are engines in the hardware but they would all be > >> running in parallel without any significant locking. The first time > >> where any thread needs exclusive access to the driver is at the point > >> of the actual hang recovery but the time it takes to get there would > >> theoretically be lower and the time it actually takes to do per-engine > >> hang recovery is quite a lot lower than the time it takes to actually > >> detect a hang reliably. > >> > >> How much we would save by such a change still needs to be analysed and > >> compared against the current single-thread model but it makes sense > >> from a theoretical design point of view. > >> > >> 2. How does per-engine reset integrate with the public reset stats > >> IOCTL? These stats are used for the GL robustness interface and > >> currently these tests are failing when running per-engine hang recovery > >> since we treat per-engine recovery differently from full GPU recovery, > >> which is nothing that userland knows anything about. When userland > >> expects to hang the hardware it expects the reset stat interface to > >> reflect this, which is something that has changed as part of this code > >> submission. There's more than one way to solve this. Here are two options: > >> > >> 1. Expose per-engine reset statistics and set contexts as > >> guilty the same way for per-engine reset as for full GPU > >> resets. > >> > >> That would make this change to the hang recovery mechanism > >> transparent to userland but it would change the semantics since > >> an active context in the reset stats no longer implies that the > >> GPU was fully reset. > >> > >> 2. Add a new set of statistics for per-engine reset (one group > >> of statistics for each engine) to reflect the extended > >> capabilities that per-engine hang recovery offers. > >> > >> Would that be breaking the ABI? > >> > >> ... Or are there any other way of doing this? > >> > >>* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8. > >> > >>This feature allows userland applications to control whether or not individual > >>batch buffers should have a first-level, fine-grained, hardware-based hang > >>detection mechanism on top of the ordinary, software-based periodic hang > >>checker that is already in the driver. The advantage over relying solely on the > >>current software-based hang checker is that the watchdog timeout mechanism is > >>about 1000x quicker and more precise. Since it's not a full driver-level hang > >>detection mechanism but only targetting one individual batch buffer at a time > >>it can afford to be that quick without risking an increase in false positive > >>hang detection. > >> > >>This feature includes the following changes: > >> > >>a) Watchdog timeout interrupt service routine for handling watchdog interrupts > >>and connecting these to per-engine hang recovery. > >> > >>b) Injection of watchdog timer enablement/cancellation instructions > >>before/after the batch buffer start instruction in the ring buffer so that > >>watchdog timeout is connected to the submission of an individual batch buffer. > >> > >>c) Extension of the DRM batch buffer interface, exposing the watchdog timeout > >>feature to userland. We've got two open source groups in VPG currently in the > >>process of integrating support for this feature, which should make it > >>principally possible to upstream this extension. > >> > >>There is currently full watchdog timeout support for gen7 in GMIN and it is > >>quite similar to the gen8 implementation so there is nothing obvious that > >>prevents us from upstreaming that code along with the gen8 code. However, > >>watchdog timeout is fully dependent on the per-engine hang recovery path and > >>that is not part of this code submission for gen7. Therefore watchdog timeout > >>support for gen7 has been excluded until per-engine hang recovery support for > >>gen7 has landed upstream. > >> > >>As part of this submission we've had to reinstate the work queue that was > >>previously in place between the error handler and the hang recovery path. The > >>reason for this is that the per-engine recovery path is called directly from > >>the interrupt handler in the case of watchdog timeout. In that situation > >>there's no way of grabbing the struct_mutex, which is a requirement for the > >>hang recovery path. Therefore, by reinstating the work queue we provide a > >>unified execution context for the hang recovery code that allows the hang > >>recovery code to grab whatever locks it needs without sacrificing interrupt > >>latency too much or sleeping indefinitely in hard interrupt context. > >> > >>* Feature 3. Context Submission Status Consistency checking > >> > >>Something that becomes apparent when you run long-duration operations tests > >>with concurrent rendering processes with intermittently injected hangs is that > >>it seems like the GPU forgets to send context completion interrupts to the > >>driver under some circumstances. What this means is that the driver sometimes > >>gets stuck on a context that never seems to finish, all the while the hardware > >>has completed and is waiting for more work. > >> > >>The problem with this is that the per-engine hang recovery path relies on > >>context resubmission to kick off the hardware again following an engine reset. > >>This can only be done safely if the hardware and driver share the same opinion > >>about the current state. Therefore we've extended the periodic hang checker to > >>check for context submission state inconsistencies aside from the hang checking > >>it already does. > >> > >>If such a state is detected it is assumed (based on experience) that a context > >>completion interrupt has been lost somehow. If this state persists for some > >>time an attempt to correct it is made by faking the presumably lost context > >>completion interrupt by manually calling the execlist interrupt handler, which > >>is normally called from the main interrupt handler cued by a received context > >>event interrupt. Just because an interrupt goes missing does not mean that the > >>context status buffer (CSB) does not get appropriately updated by the hardware, > >>which means that we can expect to find all the recent changes to the context > >>states for each engine captured there. If there are outstanding context status > >>changes in store there then the faked context event interrupt will allow the > >>interrupt handler to act on them. In the case of lost context completion > >>interrupts this will prompt the driver to remove the already completed context > >>from the execlist queue and move on to the next pending piece of work and > >>thereby eliminating the inconsistency. > >> > >>* Feature 4. Debugfs extensions for per-engine hang recovery and TDR/watchdog trace > >>points. > >> > >> > >>Tomas Elf (11): > >> drm/i915: Early exit from semaphore_waits_for for execlist mode. > >> drm/i915: Introduce uevent for full GPU reset. > >> drm/i915: Add reset stats entry point for per-engine reset. > >> drm/i915: Adding TDR / per-engine reset support for gen8. > >> drm/i915: Extending i915_gem_check_wedge to check engine reset in > >> progress > >> drm/i915: Disable warnings for TDR interruptions in the display > >> driver. > >> drm/i915: Reinstate hang recovery work queue. > >> drm/i915: Watchdog timeout support for gen8. > >> drm/i915: Fake lost context interrupts through forced CSB check. > >> drm/i915: Debugfs interface for per-engine hang recovery. > >> drm/i915: TDR/watchdog trace points. > >> > >> drivers/gpu/drm/i915/i915_debugfs.c | 146 +++++- > >> drivers/gpu/drm/i915/i915_dma.c | 79 +++ > >> drivers/gpu/drm/i915/i915_drv.c | 201 ++++++++ > >> drivers/gpu/drm/i915/i915_drv.h | 91 +++- > >> drivers/gpu/drm/i915/i915_gem.c | 93 +++- > >> drivers/gpu/drm/i915/i915_gpu_error.c | 2 +- > >> drivers/gpu/drm/i915/i915_irq.c | 378 ++++++++++++-- > >> drivers/gpu/drm/i915/i915_params.c | 10 + > >> drivers/gpu/drm/i915/i915_reg.h | 13 + > >> drivers/gpu/drm/i915/i915_trace.h | 298 +++++++++++ > >> drivers/gpu/drm/i915/intel_display.c | 16 +- > >> drivers/gpu/drm/i915/intel_lrc.c | 858 ++++++++++++++++++++++++++++++- > >> drivers/gpu/drm/i915/intel_lrc.h | 16 +- > >> drivers/gpu/drm/i915/intel_lrc_tdr.h | 40 ++ > >> drivers/gpu/drm/i915/intel_ringbuffer.c | 87 +++- > >> drivers/gpu/drm/i915/intel_ringbuffer.h | 109 ++++ > >> drivers/gpu/drm/i915/intel_uncore.c | 241 ++++++++- > >> include/uapi/drm/i915_drm.h | 5 +- > >> 18 files changed, 2589 insertions(+), 94 deletions(-) > >> create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h > >> > >>-- > >>1.7.9.5 > >> > >>_______________________________________________ > >>Intel-gfx mailing list > >>Intel-gfx@xxxxxxxxxxxxxxxxxxxxx > >>http://lists.freedesktop.org/mailman/listinfo/intel-gfx > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx