On Wed, Mar 14, 2012 at 8:12 PM, Zhu Yanhai <zhu.yanhai@xxxxxxxxx> wrote: > 2012/3/15 Ying Han <yinghan@xxxxxxxxxx>: >> On Wed, Mar 14, 2012 at 12:53 AM, Zhu Yanhai <zhu.yanhai@xxxxxxxxx> wrote: >>> Hi all, >>> Just a quick question, could you please tell me what's the current >>> status of the development of per cgroup background reclaim? This topic >>> seems to be silent after Han Ying's patchset V7 >>> (http://lwn.net/Articles/440073/) and Kame's async reclaim patchset V3 >>> (https://lkml.org/lkml/2011/5/26/20), and I can't find it in >>> memcg-devel tree either. >>> Is anyone still working on this? >> >> There were some discussions on going w/ per-memcg kswapd thread or >> workqueue by that time. And now I think we agree to go w/ the >> per-memcg thread model. I haven't done much work since then, and one >> of questions is to demonstrate the need of this feature. >> >> I am glad you are asking, do you have workload showing problems w/o it? > > Yes, the background is we have a cluster of about 3k-4k servers, all > running JVMs. Because the load of each Java application is small, we > gave each of them a small GC heap, which was 1.5GB or so. Then to take > full use of the huge memory of the servers, we setup several XEN based > virtual machines on each physical box, each XEN VM had one single JVM > running in it. > Now we are trying to switch to a LXC/cgroup based solution, at the > first step we have built a small experimental cluster online, the > containers equipped with memcg are sizing to the same size with the > XEN VMs (we haven't enabled other controllers since the major need for > resources came from the memory, while the pressure against CPU and IO > usually is smaller). As soon as they are online, we noticed that the > latency recorded in the client side had periodic peaks. We also > noticed that the memory.failcnt counter periodic increased. By > enabling kmem:mm_directreclaim_reclaimall, > kmem:mm_vmscan_direct_reclaim_begin, kmem:mm_vmscan_direct_reclaim_end > in trace events, we can see that kmem:mm_directreclaim_reclaimall came > out regularly, without kmem:mm_vmscan_direct_reclaim_begin/end seen. > That means the caller of do_try_to_free_pages is > try_to_free_mem_cgroup_pages, not the global caller try_to_free_pages. > To make load balance in the cluster wide, we tend to dispatch the Java > apps average over the containers on different physical boxes, that's > to say unless the cluster is to be filled up, the number of active > containers on each box won't be large. I think under such scenario the > global pressure can never be high, so kswapd keep sleeping most of the > time, however the local pressure within one single cgroup maybe very > high, which results to frequent direct reclaim. > The kernel we are using is a custom RHEL6U2 kernel. We have a kernel > team here so it's fine to backport the upstream solution if there is > any. Thank you for the information. Sounds like that is exactly what the per-memcg kswapd was designed for. Two things I am interested to looking into now: 1. why the per-memcg direct reclaim introduce such noticeable latency spike, what workload that can reproduce that? 2. we can quickly patch the last version of per-memcg kswapd (V6) on your environment and see what difference it makes. Maybe I can help on that. Thanks --Ying > -- > Thanks, > Zhu Yanhai > >> --Ying >> >>> >>> Thanks, >>> Zhu Yanhai -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html