Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep

Mel Gorman <mgorman@xxxxxxx> · Thu, 19 May 2011 10:19:06 +0100

On Thu, May 19, 2011 at 07:42:29AM +0900, Minchan Kim wrote:
> On Wed, May 18, 2011 at 6:47 PM, Mel Gorman <mgorman@xxxxxxx> wrote:
> > On Tue, May 17, 2011 at 04:22:26PM -0700, Andrew Morton wrote:
> >> On Tue, 17 May 2011 10:37:04 +0400
> >> James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
> >>
> >> > On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote:
> >> > > On Mon, 16 May 2011 16:06:57 +0100
> >> > > Mel Gorman <mgorman@xxxxxxx> wrote:
> >> > >
> >> > > > Under constant allocation pressure, kswapd can be in the situation where
> >> > > > sleeping_prematurely() will always return true even if kswapd has been
> >> > > > running a long time. Check if kswapd needs to be scheduled.
> >> > > >
> >> > > > Signed-off-by: Mel Gorman <mgorman@xxxxxxx>
> >> > > > Acked-by: Rik van Riel <riel@xxxxxxxxxx>
> >> > > > ---
> >> > > >  mm/vmscan.c |    4 ++++
> >> > > >  1 files changed, 4 insertions(+), 0 deletions(-)
> >> > > >
> >> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > > > index af24d1e..4d24828 100644
> >> > > > --- a/mm/vmscan.c
> >> > > > +++ b/mm/vmscan.c
> >> > > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> > > >         unsigned long balanced = 0;
> >> > > >         bool all_zones_ok = true;
> >> > > >
> >> > > > +       /* If kswapd has been running too long, just sleep */
> >> > > > +       if (need_resched())
> >> > > > +               return false;
> >> > > > +
> >> > > >         /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> >> > > >         if (remaining)
> >> > > >                 return true;
> >> > >
> >> > > I'm a bit worried by this one.
> >> > >
> >> > > Do we really fully understand why kswapd is continuously running like
> >> > > this?  The changelog makes me think "no" ;)
> >> > >
> >> > > Given that the page-allocating process is madly reclaiming pages in
> >> > > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
> >> > > different CPU, we should pretty promptly get into a situation where
> >> > > kswapd can suspend itself.  But that obviously isn't happening.  So
> >> > > what *is* going on?
> >> >
> >> > The triggering workload is a massive untar using a file on the same
> >> > filesystem, so that's a continuous stream of pages read into the cache
> >> > for the input and a stream of dirty pages out for the writes.  We
> >> > thought it might have been out of control shrinkers, so we already
> >> > debugged that and found it wasn't.  It just seems to be an imbalance in
> >> > the zones that the shrinkers can't fix which causes
> >> > sleeping_prematurely() to return true almost indefinitely.
> >>
> >> Is the untar disk-bound?  The untar has presumably hit the writeback
> >> dirty_ratio?  So its rate of page allocation is approximately equal to
> >> the write speed of the disks?
> >>
> >
> > A reasonable assumption but it gets messy.
> >
> >> If so, the VM is consuming 100% of a CPU to reclaim pages at a mere
> >> tens-of-megabytes-per-second.  If so, there's something seriously wrong
> >> here - under favorable conditions one would expect reclaim to free up
> >> 100,000 pages/sec, maybe more.
> >>
> >> If the untar is not disk-bound and the required page reclaim rate is
> >> equal to the rate at which a CPU can read, decompress and write to
> >> pagecache then, err, maybe possible.  But it still smells of
> >> inefficient reclaim.
> >>
> >
> > I think it's higher than just the rate of data but couldn't guess by
> > how much exactly. Reproducing this locally would have been nice but
> > the following conditions are likely happening on the problem machine.
> >
> >   SLUB is using high-orders for its slabs, kswapd and reclaimers are
> >   reclaiming at a faster rate than required for just the data. SLUB
> >   is using order-2 allocs for inodes so every 18 files created by
> >   untar, we need an order-2 page. For ext4_io_end, we need order-3
> >   allocs and we are allocating these due to delayed block allocation.
> >
> >   So for example: 50 files, each less than 1 page in size needs 50
> >   order-0 pages, 3 order-2 page and 2 order-3 pages
> >
> >   To satisfy the high order pages, we are reclaiming at least 28
> >   pages. For compaction, we are migrating these so we are allocating
> >   a further 28 pages and then copying putting further pressure on
> >   the system. We may do this multiple times as order-0 allocations
> >   could be breaking up the pages again. Without compaction, we are
> >   only reclaiming but can get stalled for significant periods of
> >   time if dirty or writeback pages are encountered in the contiguous
> >   blocks and can reclaim too many pages quite easily.
> >
> > So the rate of allocation required to write out data is higher than
> > just the data rate. The reclaim rate could be just fine but the number
> > of pages we need to reclaim to allocate slab objects can be screwy.
> >
> >> > > Secondly, taking an up-to-100ms sleep in response to a need_resched()
> >> > > seems pretty savage and I suspect it risks undesirable side-effects.  A
> >> > > plain old cond_resched() would be more cautious.  But presumably
> >> > > kswapd() is already running cond_resched() pretty frequently, so why
> >> > > didn't that work?
> >> >
> >> > So the specific problem with cond_resched() is that kswapd is still
> >> > runnable, so even if there's other work the system can be getting on
> >> > with, it quickly comes back to looping madly in kswapd.  If we return
> >> > false from sleeping_prematurely(), we stop kswapd until its woken up to
> >> > do more work.  This manifests, even on non sandybridge systems that
> >> > don't hang as a lot of time burned in kswapd.
> >> >
> >> > I think the sandybridge bug I see on the laptop is that cond_resched()
> >> > is somehow ineffective:  kswapd is usually hogging one CPU and there are
> >> > runnable processes but they seem to cluster on other CPUs, leaving
> >> > kswapd to spin at close to 100% system time.
> >> >
> >> > When the problem was first described, we tried sprinkling more
> >> > cond_rescheds() in the shrinker loop and it didn't work.
> >>
> >> Seems to me that kswapd for some reason is doing too much work.  Or,
> >> more specifically is doing its work very inefficiently.  Making kswapd
> >> take arbitrary naps when it's misbehaving didn't fix that misbehaviour!
> >>
> >
> > It is likely to be doing work inefficiently in one of two ways
> >
> >  1. We are reclaiming far more pages than required by the data
> >     for slab objects
> >
> >  2. The rate we are reclaiming is fast enough that dirty pages are
> >     reaching the end of the LRU quickly
> >
> > The latter part is also important. I doubt we are getting stalled in
> > writepage as this is new data being written to disk to blocks aren't
> > allocated yet but kswapd is encountering the dirty_ratio of pages
> > on the LRU and churning them through the LRU and reclaims the clean
> > pages in between.
> >
> > In effect, this "sorts" the LRU lists so the dirty pages get grouped
> > together. At worst on a 2G system such as James', we have 104857
> > (20% of memory in pages) pages together on the LRU, all dirty and
> > all being skipped over by kswapd and direct reclaimers. This is at
> > least 3276 takings of the zone LRU lock assuming we isolate pages in
> > groups of SWAP_CLUSTER_MAX which a lot of list walking and CPU usage
> > for no pages reclaimed.
> >
> > In this case, kswapd might as well take a brief nap as it can't clean
> > the pages so the flusher threads can get some work done.
> >
> >> It would be interesting to watch kswapd's page reclaim inefficiency
> >> when this is happening: /proc/vmstat:pgscan_kswapd_* versus
> >> /proc/vmstat:kswapd_steal.  If that ration is high then kswapd is
> >> scanning many pages and not reclaiming them.
> >>
> >> But given the prominence of shrink_slab in the traces, perhaps that
> >> isn't happening.
> >>
> >
> > As we are aggressively shrinking slab, we can reach the stage where
> > we scan the requested number of objects and reclaim none of them
> > potentially setting zone->all_unreclaimable to 1 if a lot of scanning
> > has also taken place recently without pages being freed. Once this
> > happens, kswapd isn't even trying to reclaim pages and is instead stuck
> > in shrink_slab until a page is freed clearing zone->all_unreclaimable
> > and zone->pages-scanned.
> 
> Why does it stuck in shrink_slab?
> If the zone is trouble to reclaim(ie, all_unreclaimable is set),
> kswapd will poll the zone only in case of DEF_PRIORITY(ie, small
> window) for when the problem goes away.

"stuck in shrink" was a poor choice of words. I should have said we
can spend a lot of time in there.

True, kswapd will only poll the zones while all_unreclaimable is
set but it only takes one page to be freed to the per-cpu list to
clear all_unreclaimable again. Once any zone has all_unreclaimable
cleared, the watermarks are checked but with enough direct
reclaimers, it's possible watermarks are met so shrink_zone is not
called but shrink_slab is called anyway. Depending on the result,
all_unreclaimable can get set again (possibly incorrectly as there
is simply no reclaimable slab objects rather than the zone is truely
unreclaimable). Another scenario is all zones except ZONE_DMA have
all_unreclaimable set when kswapd runs. kswapd finds the watermarks
to be ok as the zone is only lightly used so skips shrink_zone()
but calls shrink_slab() anyway.

Both of these situations would allow kswapd to use a lot of CPU while
spending a significant percentage of it in shrink_slab().

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html