On Thu 18-08-11 15:58:21, Johannes Weiner wrote: > On Thu, Aug 18, 2011 at 02:57:54PM +0200, Michal Hocko wrote: > > I have just realized that num_online_nodes should be much better than > > MAX_NUMNODES. > > Just for reference, the patch is based on top of > > https://lkml.org/lkml/2011/8/9/82 (it doesn't depend on it but it also > > doesn't make much sense without it) > > > > Changes since v2: > > - use num_online_nodes rather than MAX_NUMNODES > > Changes since v1: > > - reclaim nr_nodes * SWAP_CLUSTER_MAX in mem_cgroup_force_empty > > --- > > From: Michal Hocko <mhocko@xxxxxxx> > > Subject: memcg: add nr_pages argument for hierarchical reclaim > > > > Now that we are doing memcg direct reclaim limited to nr_to_reclaim > > pages (introduced by "memcg: stop vmscan when enough done.") we have to > > be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for > > most callers but it might cause failures for limit resize or force_empty > > code paths on big NUMA machines. > > The limit resizing path retries as long as reclaim makes progress, so > this is just handwaving. limit resizing paths do not check the return value of mem_cgroup_hierarchical_reclaim so the number of retries is not affected. It is true that fixing that would be much easier. > > After Kame's patch, the force-empty path has an increased risk of > failing to move huge pages to the parent, because it tries reclaim > only once. This could need further evaluation, and possibly a fix. Agreed > But instead: > > > @@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m > > if (!(gfp_mask & __GFP_WAIT)) > > return CHARGE_WOULDBLOCK; > > > > + /* > > + * We are lying about nr_pages because we do not want to > > + * reclaim too much for THP pages which should rather fallback > > + * to small pages. > > + */ > > ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL, > > - gfp_mask, flags, NULL); > > + gfp_mask, flags, NULL, > > + 1); > > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > > return CHARGE_RETRY; > > /* > > You tell it to reclaim _less_ than before, further increasing the risk > of failure... > > > @@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag > > .may_writepage = !laptop_mode, > > .may_unmap = 1, > > .may_swap = !noswap, > > - .nr_to_reclaim = SWAP_CLUSTER_MAX, > > + .nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX), > > ...but wait, this transparently fixes it up and ignores the caller's > request. > > Sorry, but this is just horrible! Yes, I do not like it as well and tried to point it out in the comment. Anyway I do agree that this doesn't solve the problem you are describing above and the limit resizing paths can be fixed much easier so the patch is pointless. > > For the past weeks I have been chasing memcg bugs that came in with > sloppy and untested code, that was merged for handwavy reasons. Yes, I feel big responsibility about that. > > Changes to algorithms need to be tested and optimizations need to be > quantified in other parts of the VM and the kernel, too. I have no > idea why this doesn't seem to apply to the memory cgroup subsystem. Yes, we should definitely do better during review process. -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>