2012/3/15 Ying Han <yinghan@xxxxxxxxxx>: > On Wed, Mar 14, 2012 at 12:53 AM, Zhu Yanhai <zhu.yanhai@xxxxxxxxx> wrote: >> Hi all, >> Just a quick question, could you please tell me what's the current >> status of the development of per cgroup background reclaim? This topic >> seems to be silent after Han Ying's patchset V7 >> (http://lwn.net/Articles/440073/) and Kame's async reclaim patchset V3 >> (https://lkml.org/lkml/2011/5/26/20), and I can't find it in >> memcg-devel tree either. >> Is anyone still working on this? > > There were some discussions on going w/ per-memcg kswapd thread or > workqueue by that time. And now I think we agree to go w/ the > per-memcg thread model. I haven't done much work since then, and one > of questions is to demonstrate the need of this feature. > > I am glad you are asking, do you have workload showing problems w/o it? Yes, the background is we have a cluster of about 3k-4k servers, all running JVMs. Because the load of each Java application is small, we gave each of them a small GC heap, which was 1.5GB or so. Then to take full use of the huge memory of the servers, we setup several XEN based virtual machines on each physical box, each XEN VM had one single JVM running in it. Now we are trying to switch to a LXC/cgroup based solution, at the first step we have built a small experimental cluster online, the containers equipped with memcg are sizing to the same size with the XEN VMs (we haven't enabled other controllers since the major need for resources came from the memory, while the pressure against CPU and IO usually is smaller). As soon as they are online, we noticed that the latency recorded in the client side had periodic peaks. We also noticed that the memory.failcnt counter periodic increased. By enabling kmem:mm_directreclaim_reclaimall, kmem:mm_vmscan_direct_reclaim_begin, kmem:mm_vmscan_direct_reclaim_end in trace events, we can see that kmem:mm_directreclaim_reclaimall came out regularly, without kmem:mm_vmscan_direct_reclaim_begin/end seen. That means the caller of do_try_to_free_pages is try_to_free_mem_cgroup_pages, not the global caller try_to_free_pages. To make load balance in the cluster wide, we tend to dispatch the Java apps average over the containers on different physical boxes, that's to say unless the cluster is to be filled up, the number of active containers on each box won't be large. I think under such scenario the global pressure can never be high, so kswapd keep sleeping most of the time, however the local pressure within one single cgroup maybe very high, which results to frequent direct reclaim. The kernel we are using is a custom RHEL6U2 kernel. We have a kernel team here so it's fine to backport the upstream solution if there is any. -- Thanks, Zhu Yanhai > --Ying > >> >> Thanks, >> Zhu Yanhai -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html