> -----Original Message----- > From: daniel.vetter at ffwll.ch [mailto:daniel.vetter at ffwll.ch] On Behalf Of > Daniel Vetter > Sent: Monday, June 10, 2013 3:38 PM > To: Bloomfield, Jon > Cc: Carsten Emde; Chris Wilson; Jesse Barnes; Intel Graphics Development; > Steven Rostedt; Christoph Mathys; stable > Subject: Re: [PATCH] drm/i915: Workaround incoherence > between fences and LLC across multiple CPUs > > On Mon, Jun 10, 2013 at 3:17 PM, Bloomfield, Jon > <jon.bloomfield at intel.com> wrote: > >> -----Original Message----- > >> From: daniel.vetter at ffwll.ch [mailto:daniel.vetter at ffwll.ch] On > >> Behalf Of Daniel Vetter > >> Sent: Saturday, June 08, 2013 10:22 PM > >> To: Carsten Emde > >> Cc: Chris Wilson; Jesse Barnes; Intel Graphics Development; > >> Bloomfield, Jon; Steven Rostedt; Christoph Mathys; stable > >> Subject: Re: [PATCH] drm/i915: Workaround incoherence > >> between fences and LLC across multiple CPUs > >> > >> On Sat, Jun 8, 2013 at 10:41 PM, Carsten Emde <C.Emde at osadl.org> > wrote: > >> > On 04/08/2013 11:54 AM, Daniel Vetter wrote: > >> >> > >> >> On Fri, Apr 05, 2013 at 09:03:27AM -0700, Jesse Barnes wrote: > >> >>> > >> >>> On Thu, 4 Apr 2013 21:31:03 +0100 Chris Wilson > >> >>> <chris at chris-wilson.co.uk> wrote: > >> >>> > >> >>>> In order to fully serialize access to the fenced region and the > >> >>>> update to the fence register we need to take extreme measures on > >> >>>> SNB+, and manually flush writes to memory prior to writing the > >> >>>> fence register in conjunction with the memory barriers placed > >> >>>> around > >> the register write. > >> >>>> > >> >>>> Fixes i-g-t/gem_fence_thrash > >> >>>> > >> >>>> v2: Bring a bigger gun > >> >>>> v3: Switch the bigger gun for heavier bullets (Arjan van de Ven) > >> >>>> v4: Remove changes for working generations. > >> >>>> v5: Reduce to a per-cpu wbinvd() call prior to updating the fences. > >> >>>> v6: Rewrite comments to ellide forgotten history. > >> >>>> > >> >>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=62191 > >> >>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk> > >> >>>> Cc: Jon Bloomfield <jon.bloomfield at intel.com> > >> >>>> Tested-by: Jon Bloomfield <jon.bloomfield at intel.com> (v2) > >> >>>> Cc: stable at vger.kernel.org > >> >>>> --- > >> >>>> drivers/gpu/drm/i915/i915_gem.c | 28 > >> +++++++++++++++++++++++----- > >> >>>> 1 file changed, 23 insertions(+), 5 deletions(-) > >> >>>> > >> >>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c > >> >>>> b/drivers/gpu/drm/i915/i915_gem.c index fa4ea1a..8f7739e 100644 > >> >>>> --- a/drivers/gpu/drm/i915/i915_gem.c > >> >>>> +++ b/drivers/gpu/drm/i915/i915_gem.c > >> >>>> @@ -2689,17 +2689,35 @@ static inline int fence_number(struct > >> >>>> drm_i915_private *dev_priv, > >> >>>> return fence - dev_priv->fence_regs; > >> >>>> } > >> >>>> > >> >>>> +static void i915_gem_write_fence__ipi(void *data) { > >> >>>> + wbinvd(); > >> >>>> +} > >> >>>> + > >> >>>> static void i915_gem_object_update_fence(struct > >> >>>> drm_i915_gem_object *obj, > >> >>>> struct > >> >>>> drm_i915_fence_reg *fence, > >> >>>> bool enable) > >> >>>> { > >> >>>> - struct drm_i915_private *dev_priv = obj->base.dev- > >dev_private; > >> >>>> - int reg = fence_number(dev_priv, fence); > >> >>>> - > >> >>>> - i915_gem_write_fence(obj->base.dev, reg, enable ? obj : > NULL); > >> >>>> + struct drm_device *dev = obj->base.dev; > >> >>>> + struct drm_i915_private *dev_priv = dev->dev_private; > >> >>>> + int fence_reg = fence_number(dev_priv, fence); > >> >>>> + > >> >>>> + /* In order to fully serialize access to the fenced region and > >> >>>> + * the update to the fence register we need to take extreme > >> >>>> + * measures on SNB+. In theory, the write to the fence register > >> >>>> + * flushes all memory transactions before, and coupled with the > >> >>>> + * mb() placed around the register write we serialise all > memory > >> >>>> + * operations with respect to the changes in the tiler. Yet, on > >> >>>> + * SNB+ we need to take a step further and emit an > >> >>>> + explicit > >> >>>> wbinvd() > >> >>>> + * on each processor in order to manually flush all memory > >> >>>> + * transactions before updating the fence register. > >> >>>> + */ > >> >>>> + if (HAS_LLC(obj->base.dev)) > >> >>>> + on_each_cpu(i915_gem_write_fence__ipi, NULL, 1); > >> >>>> + i915_gem_write_fence(dev, fence_reg, enable ? obj : > >> >>>> + NULL); > >> >>>> > >> >>>> if (enable) { > >> >>>> - obj->fence_reg = reg; > >> >>>> + obj->fence_reg = fence_reg; > >> >>>> fence->obj = obj; > >> >>>> list_move_tail(&fence->lru_list, > >> >>>> &dev_priv->mm.fence_list); > >> >>>> } else { > >> >>> > >> >>> > >> >>> I almost wish x86 just gave up and went fully weakly ordered. At > >> >>> least then we'd know we need barriers everywhere, rather than > >> >>> just in mystical places. > >> >>> > >> >>> Reviewed-by: Jesse Barnes <jbarnes at virtuousgeek.org> > >> >> > >> >> > >> >> Queued for -next, thanks for the patch. I am a bit unhappy that > >> >> the testcase doesn't work too well and can't reliably reproduce > >> >> this on all affected machines. But I've tried a bit to improve > >> >> things and it's very fickle. I guess we just have to accept this > >> >> :( > >> > > >> > > >> > Under real-time conditions when we expect deterministic response to > >> > external and internal events at any time within a couple of > >> > microseconds, invalidating and flushing the entire cache in a > >> > running system is unacceptable. This is introducing latencies of > >> > several milliseconds which was clearly shown in RT regression tests > >> > on a kernel with this patch applied > (https://www.osadl.org/?id=1543#c7602). > >> > We therefore have to revert it in the RT patch queue - kind of a > >> > workaround of a workaround. > >> > > >> > Would simply be wonderful, if we could get rid of the hateful wbinvd(). > >> > >> I'm somewhat surprised people have not started to scream earlier > >> about this one ;-) > >> > >> We're trying to figure out whether there's a less costly w/a (we have > >> some benchmark where it kills performance, too) but until that's done > >> we pretty much need to stick with it. If you want to avoid any bad > >> side-effects due to that w/a you can alternatively pin all gpu > >> rendering tasks to the same cpu core/thread. > >> -Daniel > >> -- > >> Daniel Vetter > >> Software Engineer, Intel Corporation > >> +41 (0) 79 365 57 48 - http://blog.ffwll.ch > > > > This is such a serious bug, I'm surprised it hasn't been seen elsewhere > already. We shouldn't just not fix it because the only known fix could have a > performance impact. There's no point having fast tiled accesses that can't be > relied upon. We need to release an interim fix while we look for better > solutions. > > > > Daniel, how does pinning to a single CPU avoid the wbinvd() ? The page- > fault code doesn't know not to apply the workaround in that case surely ? > > It's only that the incoherency only shows up if you both migrate threads > between different cpus and also need to steal fence registers. > If you pin all threads to the same cpu thread/core (with e.g. with numactl or > sched_setaffinity tools) then we haven't yet seen a corruption. > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch Ok, you're saying we could back the fix out, and force all apps mapping tiled-buffers to set the CPU affinity to an agreed single CPU ? All apps would have to use the same CPU, otherwise we're back where we started, in which case I think performance would still nose-dive. --------------------------------------------------------------------- Intel Corporation (UK) Limited Registered No. 1134945 (England) Registered Office: Pipers Way, Swindon SN3 1RJ VAT No: 860 2173 47 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.