Re: What's the progress of per cgroup background reclaim?

Ying Han <yinghan@xxxxxxxxxx> · Thu, 15 Mar 2012 09:31:38 -0700

On Wed, Mar 14, 2012 at 8:12 PM, Zhu Yanhai <zhu.yanhai@xxxxxxxxx> wrote:
> 2012/3/15 Ying Han <yinghan@xxxxxxxxxx>:
>> On Wed, Mar 14, 2012 at 12:53 AM, Zhu Yanhai <zhu.yanhai@xxxxxxxxx> wrote:
>>> Hi all,
>>> Just a quick question, could you please tell me what's the current
>>> status of the development of per cgroup background reclaim? This topic
>>> seems to be silent after Han Ying's patchset V7
>>> (http://lwn.net/Articles/440073/) and Kame's async reclaim patchset V3
>>> (https://lkml.org/lkml/2011/5/26/20), and I can't find it in
>>> memcg-devel tree either.
>>> Is anyone still working on this?
>>
>> There were some discussions on going w/ per-memcg kswapd thread or
>> workqueue by that time. And now I think we agree to go w/ the
>> per-memcg thread model.  I haven't done much work since then, and one
>> of questions is to demonstrate the need of this feature.
>>
>> I am glad you are asking, do you have workload showing problems w/o it?
>
> Yes, the background is we have a cluster of about 3k-4k servers, all
> running JVMs. Because the load of each Java application is small, we
> gave each of them a small GC heap, which was 1.5GB or so. Then to take
> full use of the huge memory of the servers, we setup several XEN based
> virtual machines on each physical box, each XEN VM had one single JVM
> running in it.
> Now we are trying to switch to a LXC/cgroup based solution,  at the
> first step we have built a small experimental cluster online, the
> containers equipped with memcg are sizing to the same size with the
> XEN VMs (we haven't enabled other controllers since the major need for
> resources came from the memory, while the pressure against CPU and IO
> usually is smaller). As soon as they are online, we noticed that the
> latency recorded in the client side had periodic peaks. We also
> noticed that the memory.failcnt counter periodic increased. By
> enabling kmem:mm_directreclaim_reclaimall,
> kmem:mm_vmscan_direct_reclaim_begin, kmem:mm_vmscan_direct_reclaim_end
> in trace events, we can see that kmem:mm_directreclaim_reclaimall came
> out regularly, without  kmem:mm_vmscan_direct_reclaim_begin/end seen.
> That means the caller of do_try_to_free_pages is
> try_to_free_mem_cgroup_pages, not the global caller try_to_free_pages.
> To make load balance in the cluster wide, we tend to dispatch the Java
> apps average over the containers on different physical boxes, that's
> to say unless the cluster is to be filled up, the number of active
> containers on each box won't be large. I think under such scenario the
> global pressure can never be high, so kswapd keep sleeping most of the
> time, however the local pressure within one single cgroup maybe very
> high, which results to frequent direct reclaim.
> The kernel we are using is a custom RHEL6U2 kernel. We have a kernel
> team here so it's fine to backport the upstream solution if there is
> any.

Thank you for the information.

Sounds like that is exactly what the per-memcg kswapd was designed
for. Two things I am interested to looking into now:

1. why the per-memcg direct reclaim introduce such noticeable latency
spike, what workload that can reproduce that?

2. we can quickly patch the last version of per-memcg kswapd (V6) on
your environment and see what difference it makes. Maybe I can help on
that.

Thanks

--Ying

> --
> Thanks,
> Zhu Yanhai
>
>> --Ying
>>
>>>
>>> Thanks,
>>> Zhu Yanhai
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html