On Wed, Sep 16, 2020 at 09:26:58AM +0100, Chris Wilson wrote: > Quoting Greg KH (2020-09-16 07:33:58) > > On Tue, Sep 15, 2020 at 01:41:48PM +0100, Chris Wilson wrote: > > > On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl: > > > Forcibly evict stale csb entries") where, presumably, due to a missing > > > Global Observation Point synchronisation, the write pointer of the CSB > > > ringbuffer is updated _prior_ to the contents of the ringbuffer. That is > > > we see the GPU report more context-switch entries for us to parse, but > > > those entries have not been written, leading us to process stale events, > > > and eventually report a hung GPU. > > > > > > However, this effect appears to be much more severe than we previously > > > saw on Icelake (though it might be best if we try the same approach > > > there as well and measure), and Bruce suggested the good idea of resetting > > > the CSB entry after use so that we can detect when it has been updated by > > > the GPU. By instrumenting how long that may be, we can set a reliable > > > upper bound for how long we should wait for: > > > > > > 513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns) > > > > > > Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045 > > > References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries") > > > > What does "References:" mean? Should that be "Fixes:"? > > It's a reference to an earlier w/a for a previous generation for the > same symptoms. This patch should supplement that w/a. I see no such "reference" to that tag in Documentation/process/submitting-patches.rst, so how were we supposed to know this? :) thanks, greg k-h