Re: What's the progress of per cgroup background reclaim?

Zhu Yanhai <zhu.yanhai@xxxxxxxxx> · Thu, 15 Mar 2012 11:12:45 +0800

2012/3/15 Ying Han <yinghan@xxxxxxxxxx>:
> On Wed, Mar 14, 2012 at 12:53 AM, Zhu Yanhai <zhu.yanhai@xxxxxxxxx> wrote:
>> Hi all,
>> Just a quick question, could you please tell me what's the current
>> status of the development of per cgroup background reclaim? This topic
>> seems to be silent after Han Ying's patchset V7
>> (http://lwn.net/Articles/440073/) and Kame's async reclaim patchset V3
>> (https://lkml.org/lkml/2011/5/26/20), and I can't find it in
>> memcg-devel tree either.
>> Is anyone still working on this?
>
> There were some discussions on going w/ per-memcg kswapd thread or
> workqueue by that time. And now I think we agree to go w/ the
> per-memcg thread model.  I haven't done much work since then, and one
> of questions is to demonstrate the need of this feature.
>
> I am glad you are asking, do you have workload showing problems w/o it?

Yes, the background is we have a cluster of about 3k-4k servers, all
running JVMs. Because the load of each Java application is small, we
gave each of them a small GC heap, which was 1.5GB or so. Then to take
full use of the huge memory of the servers, we setup several XEN based
virtual machines on each physical box, each XEN VM had one single JVM
running in it.
Now we are trying to switch to a LXC/cgroup based solution,  at the
first step we have built a small experimental cluster online, the
containers equipped with memcg are sizing to the same size with the
XEN VMs (we haven't enabled other controllers since the major need for
resources came from the memory, while the pressure against CPU and IO
usually is smaller). As soon as they are online, we noticed that the
latency recorded in the client side had periodic peaks. We also
noticed that the memory.failcnt counter periodic increased. By
enabling kmem:mm_directreclaim_reclaimall,
kmem:mm_vmscan_direct_reclaim_begin, kmem:mm_vmscan_direct_reclaim_end
in trace events, we can see that kmem:mm_directreclaim_reclaimall came
out regularly, without  kmem:mm_vmscan_direct_reclaim_begin/end seen.
That means the caller of do_try_to_free_pages is
try_to_free_mem_cgroup_pages, not the global caller try_to_free_pages.
To make load balance in the cluster wide, we tend to dispatch the Java
apps average over the containers on different physical boxes, that's
to say unless the cluster is to be filled up, the number of active
containers on each box won't be large. I think under such scenario the
global pressure can never be high, so kswapd keep sleeping most of the
time, however the local pressure within one single cgroup maybe very
high, which results to frequent direct reclaim.
The kernel we are using is a custom RHEL6U2 kernel. We have a kernel
team here so it's fine to backport the upstream solution if there is
any.

--
Thanks,
Zhu Yanhai

> --Ying
>
>>
>> Thanks,
>> Zhu Yanhai
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html