Sorry, my mailer might have used intelligence to send HTML (that is what happens when the setup changes, I apologize). Resending in text format On Sun, May 8, 2011 at 3:29 AM, Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx> wrote: > > > On Tue, May 3, 2011 at 4:18 AM, Johannes Weiner <hannes@xxxxxxxxxxx> wrote: >> >> Hi, >> >> On Mon, May 02, 2011 at 03:07:29PM -0500, James Bottomley wrote: >> > The fatal livelock in kswapd, reported in this thread: >> > >> > http://marc.info/?t=130392066000001 >> > >> > Is mitigateable if we prevent the cgroups code being so aggressive in >> > its zone shrinking (by reducing it's default shrink from 0 [everything] >> > to DEF_PRIORITY [some things]). This will have an obvious knock on >> > effect to cgroup accounting, but it's better than hanging systems. >> >> Actually, it's not that obvious. At least not to me. I added Balbir, >> who added said comment and code in the first place, to CC: Here is the >> comment in full quote: >> > > I missed this email in my inbox, just saw it and responding > >> >> /* >> * NOTE: Although we can get the priority field, using it >> * here is not a good idea, since it limits the pages we can scan. >> * if we don't reclaim here, the shrink_zone from balance_pgdat >> * will pick up pages from other mem cgroup's as well. We hack >> * the priority and make it zero. >> */ >> >> The idea is that if one memcg is above its softlimit, we prefer >> reducing pages from this memcg over reclaiming random other pages, >> including those of other memcgs. >> > > My comment and code were based on the observations I saw during my tests. > With DEF_PRIORITY we see scan >> priority in get_scan_count(), since we know > how much exactly we are over the soft limit, it makes sense to go after the > pages, so that normal balancing can be restored. > >> >> But the code flow looks like this: >> >> balance_pgdat >> mem_cgroup_soft_limit_reclaim >> mem_cgroup_shrink_node_zone >> shrink_zone(0, zone, &sc) >> shrink_zone(prio, zone, &sc) >> >> so the success of the inner memcg shrink_zone does at least not >> explicitely result in the outer, global shrink_zone steering clear of >> other memcgs' pages. > > Yes, but it allows soft reclaim to know what to target first for success > >> >> It just tries to move the pressure of balancing >> the zones to the memcg with the biggest soft limit excess. That can >> only really work if the memcg is a large enough contributor to the >> zone's total number of lru pages, though, and looks very likely to hit >> the exceeding memcg too hard in other cases. >> >> I am very much for removing this hack. There is still more scan >> pressure applied to memcgs in excess of their soft limit even if the >> extra scan is happening at a sane priority level. And the fact that >> global reclaim operates completely unaware of memcgs is a different >> story. >> >> However, this code came into place with v2.6.31-8387-g4e41695. Why is >> it only now showing up? >> >> You also wrote in that thread that this happens on a standard F15 >> installation. On the F15 I am running here, systemd does not >> configure memcgs, however. Did you manually configure memcgs and set >> soft limits? Because I wonder how it ended up in soft limit reclaim >> in the first place. >> > > I am running F15 as well, but never hit the problem so far. I am surprised > to see the stack posted on the thread, it seemed like you > never explicitly enabled anything to wake up the memcg beast :) > Balbir -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html