Re: [patch] mm: skip rebalance of hopeless zones

Ying Han <yinghan@xxxxxxxxxx> · Fri, 10 Dec 2010 11:46:31 -0800

On Fri, Dec 10, 2010 at 3:37 AM, Mel Gorman <mel@xxxxxxxxx> wrote:
> On Thu, Dec 09, 2010 at 10:39:46AM -0800, Ying Han wrote:
>> On Wed, Dec 8, 2010 at 5:23 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>> > On Wed, 8 Dec 2010 16:36:21 -0800 Simon Kirby <sim@xxxxxxxxxx> wrote:
>> >
>> >> On Wed, Dec 08, 2010 at 04:16:59PM +0100, Johannes Weiner wrote:
>> >>
>> >> > Kswapd tries to rebalance zones persistently until their high
>> >> > watermarks are restored.
>> >> >
>> >> > If the amount of unreclaimable pages in a zone makes this impossible
>> >> > for reclaim, though, kswapd will end up in a busy loop without a
>> >> > chance of reaching its goal.
>> >> >
>> >> > This behaviour was observed on a virtual machine with a tiny
>> >> > Normal-zone that filled up with unreclaimable slab objects.
>> >> >
>> >> > This patch makes kswapd skip rebalancing on such 'hopeless' zones and
>> >> > leaves them to direct reclaim.
>> >>
>> >> Hi!
>> >>
>> >> We are experiencing a similar issue, though with a 757 MB Normal zone,
>> >> where kswapd tries to rebalance Normal after an order-3 allocation while
>> >> page cache allocations (order-0) keep splitting it back up again.  It can
>> >> run the whole day like this (SSD storage) without sleeping.
>> >
>> > People at google have told me they've seen the same thing.  A fork is
>> > taking 15 minutes when someone else is doing a dd, because the fork
>> > enters direct-reclaim trying for an order-one page.  It successfully
>> > frees some order-one pages but before it gets back to allocate one, dd
>> > has gone and stolen them, or split them apart.
>>
>> So we are running into this problem in a container environment. While
>> running dd in a container with
>> bunch of system daemons like sshd, we've seen sshd being OOM killed.
>>
>
> It's possible that containers are *particularly* vunerable to this
> problem because they don't have kswapd.
In our fake numa enviroment, we do have per-container kswapd which are
the ones in container's nodemask. We also have extension for
consolidating all kswapds per-container due to bad lock contention.

As direct reclaimers go to sleep, the race between an order-1 page
being freed and another request
breaking up the order-1 page might be far more severe.

One thing we found which affecting the OOM is the logic in
inactive_file_is_low_global(), which tries to balance Active/Inactive
into 50%. If pages being promoted to Active (dirty data) and they will
be safe for being reclaimed until the LRU becomes unbalanced. So for
streaming IO, we have pages in Active list which won't be used again
and won't be scanned by page reclaim neither.

--Ying

>
>> One of the theory which we haven't fully proven is dd keep sallocating
>> and stealing pages which just being
>> reclaimed from ttfp of sshd. We've talked with Andrew and wondering if
>> there is a way to prevent that
>> happening. And we learned that we might have something for order 0
>> pages since they got freed to per-cpu
>> list and the process triggered ttfp more likely to get it unless being
>> rescheduled. But nothing for order 1 which
>> is fork() in this case.
>>
>> --Ying
>>
>> >
>> > This problem would have got worse when slub came along doing its stupid
>> > unnecessary high-order allocations.
>> >
>> > Billions of years ago a direct-reclaimer had a one-deep cache in the
>> > task_struct into which it freed the page to prevent it from getting
>> > stolen.
>> >
>> > Later, we took that out because pages were being freed into the
>> > per-cpu-pages magazine, which is effectively task-local anyway.  But
>> > per-cpu-pages are only for order-0 pages.  See slub stupidity, above