[PATCH] drm/i915: Revert shrinker changes from "Track unbound pages"

daniel at ffwll.ch (Daniel Vetter) · Thu, 10 Jan 2013 18:03:38 +0100

On Thu, Jan 10, 2013 at 04:58:30PM +0000, Chris Wilson wrote:
> On Thu, 10 Jan 2013 18:03:00 +0100, Daniel Vetter <daniel.vetter at ffwll.ch> wrote:
> > This partially reverts
> > 
> > commit 6c085a728cf000ac1865d66f8c9b52935558b328
> > Author: Chris Wilson <chris at chris-wilson.co.uk>
> > Date:   Mon Aug 20 11:40:46 2012 +0200
> > 
> >     drm/i915: Track unbound pages
> > 
> > Closer inspection of that patch revealed a bunch of unrelated changes
> > in the shrinker:
> > - The shrinker count is now in pages instead of objects.
> > - For counting the shrinkable objects the old code only looked at the
> >   inactive list, the new code looks at all bounds objects (including
> >   pinned ones). That is obviously in addition to the new unbound list.
> > - The shrinker cound is no longer scaled with
> >   sysctl_vfs_cache_pressure. Note though that with the default tuning
> >   value of vfs_cache_pressue = 100 this doesn't affect the shrinker
> >   behaviour.
> > - When actually shrinking objects, the old code first dropped
> >   purgeable objects, then normal (inactive) objects. Only then did it,
> >   in a last-ditch effort idle the gpu and evict everything. The new
> >   code omits the intermediate step of evicting normal inactive
> >   objects.
> > 
> > Safe for the first change, which seems benign, and the shrinker count
> > scaling, which is a bit a different story, the endresult of all these
> > changes is that the shrinker is _much_ more likely to fall back to the
> > last-ditch resort of idling the gpu and evicting everything.  The old
> > code could only do that if something else evicted lots of objects
> > meanwhile (since without any other changes the nr_to_scan will be
> > smaller than the object count).
> > 
> > Reverting the vfs_cache_pressure behaviour itself is a bit bogus: Only
> > dentry/inode object caches should scale their shrinker counts with
> > vfs_cache_pressure. Originally I've had that change reverted, too. But
> > Chris Wilson insisted that it's too bogus and shouldn't again see the
> > light of day.
> > 
> > Hence revert all these other changes and restore the old shrinker
> > behaviour, with the minor adjustment that we now first scan the
> > unbound list, then the inactive list for each object category
> > (purgeable or normal).
> > 
> > A similar patch has been tested by a few people affected by the gen4/5
> > hangs which started to appear in 3.7, which some people bisected to
> > the "drm/i915: Track unbound pages" commit. But just disabling the
> > unbound logic alone didn't change things at all.
> > 
> > Note that this patch doesn't fix the referenced bugs, it only hides
> > the underlying bug(s) well enough to restore pre-3.7 behaviour. The
> > key to achieve that is to massively reduce the likelyhood of going
> > into a full gpu stall and evicting everything.
> > 
> > v2: Reword commit message a bit, taking Chris Wilson's comment into
> > account.
> > 
> > v3: On Chris Wilson's insistency, do not reinstate the rather bogus
> > vfs_cache_pressure change.
> > 
> > Tested-by: Greg KH <gregkh at linuxfoundation.org>
> > Tested-by: Dave Kleikamp <dave.kleikamp at oracle.com>
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=55984
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=57122
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=56916
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=57136
> > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: stable at vger.kernel.org
> > Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
> 
> Acked-by: Chris Wilson <chris at chris-wilson.co.uk>

Picked up for -fixes.

> I just hope the clue bat descends soonest before we find another way of
> triggering the spurious hangs.

Indeed.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch