Re: [PATCH v5 00/31] kmemcg shrinkers

Mel Gorman <mgorman@xxxxxxx> · Thu, 9 May 2013 15:03:11 +0100

On Thu, May 09, 2013 at 11:18:23PM +1000, Dave Chinner wrote:
> > > Mel, I have identified the overly aggressive behavior you noticed to be a bug
> > > in the at-least-one-pass patch, that would ask the shrinkers to scan the full
> > > batch even when total_scan < batch. They would do their best for it, and
> > > eventually succeed. I also went further, and made that the behavior of direct
> > > reclaim only - The only case that really matter for memcg, and one in which
> > > we could argue that we are more or less desperate for small squeezes in memory.
> > > Thank you very much for spotting this.
> > > 
> > 
> > I haven't seen the relevant code yet but in general I do not think it is
> > a good idea for direct reclaim to potentially reclaim all of slabs like
> > this. Direct reclaim does not necessarily mean the system is desperate
> > for small amounts of memory. Lets take a few examples where it would be
> > a poor decision to reclaim all the slab pages within direct reclaim.
> > 
> > 1. Direct reclaim triggers because kswapd is stalled writing pages for
> >    memcg (see code near comment "memcg doesn't have any dirty pages
> >    throttling"). A memcg dirtying its limit of pages may cause a lot of
> >    direct reclaim and dumping all the slab pages
> > 
> > 2. Direct reclaim triggers because kswapd is writing pages out to swap.
> >    Similar to memcg above, kswapd failing to make forward progress triggers
> >    direct reclaim which then potentially reclaims all slab
> > 
> > 3. Direct reclaim triggers because kswapd waits on congestion as there
> >    are too many pages under writeback. In this case, a large amounts of
> >    writes to slow storage like USB could result in all slab being reclaimed
> > 
> > 4. The system has been up a long time, memory is fragmented and the page
> >    allocator enters direct reclaim/compaction to allocate THPs. It would
> >    be very unfortunate if allocating a THP reclaimed all the slabs
> > 
> > All that is potentially bad and likely to make Dave put in his cranky
> > pants. I would much prefer if direct reclaim and kswapd treated slab
> > similarly and not ask the shrinkers to do a full scan unless the alternative
> > is OOM kill.
> 
> Just keep in mind that I really don't care about micro-behaviours of
> the shrinker algorithm. What I look at is the overall cache balance
> under steady state workloads, the response to step changes in
> workload and what sort of overhead is seen to maintain system
> balance under memory pressure. So unless a micro-behaviour has an
> impact at the macro level, I just don't care one way or the other.
> 

Ok, that's fine by me because I think what you are worried about can
happen too easily right now.  A system in a steady state of streaming
IO can decide to reclaim excessively in direct reclaim becomes active --
a macro level change for a steady state workload.

However, Glauber has already said he will either make a priority check in
direct reclaim or make it memcg specific. I'm happy with either as either
should avoid a large impact at a macro level in response to a small change
in the workload pattern.

> But I can put on cranky panks if you want, Mel. :)
> 

Unjustified cranky pants just isn't the same :)

> > > Running postmark on the final result (at least on my 2-node box) show something
> > > a lot saner. We are still stealing more inodes than before, but by a factor of
> > > around 15 %. Since the correct balance is somewhat heuristic anyway - I
> > > personally think this is acceptable. But I am waiting to hear from you on this
> > > matter. Meanwhile, I am investigating further to try to pinpoint where exactly
> > > this comes from. It might either be because of the new node-aware behavior, or
> > > because of the increased calculation precision in the first patch.
> > > 
> > 
> > I'm going to defer to Dave as to whether that increased level of slab
> > reclaim is acceptable or not.
> 
> Depends on how it changes the balance of the system. I won't know
> that until I run some new tests.
> 

Thanks

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html