On Thu 18-10-12 15:16:54, Tejun Heo wrote: > Hello, Michal. > > On Wed, Oct 17, 2012 at 03:30:45PM +0200, Michal Hocko wrote: > > mem_cgroup_force_empty_list currently tries to remove all pages from > > the given LRU. To prevent from temoporary failures (EBUSY returned by > > mem_cgroup_move_parent) it uses a margin to the current LRU pages and > > returns the true if there are still some pages left on the list. > > > > If we consider that mem_cgroup_move_parent fails only when we are racing > > with somebody else removing the page (resp. uncharging it) or when the > > page is migrated then it is obvious that all those failures are only > > temporal and so we can safely retry later. > > Let's get rid of the safety margin and make the loop really wait for the > > empty LRU. The caller should still make sure that all charges have been > > removed from the res_counter because mem_cgroup_replace_page_cache might > > add a page to the LRU after the check (it doesn't touch res_counter > > though). > > This catches most of the cases except for shmem which might call > > mem_cgroup_replace_page_cache with a page which is not charged and on > > the LRU yet but this was the case also without this patch. In order to > > fix this we need a guarantee that try_get_mem_cgroup_from_page falls > > back to the current mm's cgroup so it needs css_tryget to fail. This > > will be fixed up in a later patch because it nees a help from cgroup > > core. > > > > Signed-off-by: Michal Hocko <mhocko@xxxxxxx> > > In the sense that "I looked at it and nothing seemed too scary". > > Reviewed-by: Tejun Heo <tj@xxxxxxxxxx> Thanks > > Some nitpicks below. > > > /* > > - * move charges to its parent. > > + * move charges to its parent or the root cgroup if the group > > + * has no parent (aka use_hierarchy==0). > > + * Although this might fail the failure is always temporary and it > > + * signals a race with a page removal/uncharge or migration. In the > > + * first case the page will vanish from the LRU on the next attempt > > + * and the call should be retried later. > > */ > > - > > Maybe convert to proper /** function comment while at it? these are internal functions and we usually do not create kerneldoc for them. But I can surely change it - it would deserve a bigger clean up then. > I also think it would be helpful to actually comment on each possible > failure case explaining why the failure condition is temporal. What about: " * Although this might fail (get_page_unless_zero, isolate_lru_page or * mem_cgroup_move_account fails) the failure is always temporary and * it signals a race with a page removal/uncharge or migration. In the * first case the page is on the way out and it will vanish from the LRU * on the next attempt and the call should be retried later. * Isolation from the LRU fails only if page has been isolated from * the LRU since we looked at it and that usually means either global * reclaim or migration going on. The page will either get back to the * LRU or vanish. * Finaly mem_cgroup_move_account fails only if the page got uncharged * (!PageCgroupUsed) or moved to a different group. The page will * disappear in the next attempt. " Better? Or should it rather be in the changelog? > > > /* > > * Traverse a specified page_cgroup list and try to drop them all. This doesn't > > - * reclaim the pages page themselves - it just removes the page_cgroups. > > - * Returns true if some page_cgroups were not freed, indicating that the caller > > - * must retry this operation. > > + * reclaim the pages page themselves - pages are moved to the parent (or root) > > + * group. > > */ > > Ditto. > > > -static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > +static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > int node, int zid, enum lru_list lru) > > { > > struct mem_cgroup_per_zone *mz; > > - unsigned long flags, loop; > > + unsigned long flags; > > struct list_head *list; > > struct page *busy; > > struct zone *zone; > > @@ -3696,11 +3701,8 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > mz = mem_cgroup_zoneinfo(memcg, node, zid); > > list = &mz->lruvec.lists[lru]; > > > > - loop = mz->lru_size[lru]; > > - /* give some margin against EBUSY etc...*/ > > - loop += 256; > > busy = NULL; > > - while (loop--) { > > + do { > > struct page_cgroup *pc; > > struct page *page; > > > > @@ -3726,8 +3728,7 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > cond_resched(); > > } else > > busy = NULL; > > - } > > - return !list_empty(list); > > + } while (!list_empty(list)); > > } > > Is there anything which can keep failing until migration to another > cgroup is complete? This is not about migration to another cgroup. Remember there are no tasks in the group so we have no origin for the migration. I was talking about migrate_pages. > I think there is, e.g., if mmap_sem is busy or memcg is co-mounted > with other controllers and another controller's ->attach() is blocking > on something. I am not sure I understand your concern. There are no tasks and we will break out the loop if some appear. And yes we can retry a lot in pathological cases. But this is a group removal path which is not hot. > If so, busy-looping blindly probably isn't a good idea and we would > want at least msleep between retries (e.g. have two lists, throw > failed ones to the other and sleep shortly when switching the front > and back lists). we do cond_resched if we fail. > > + /* > > + * This is a safety check because mem_cgroup_force_empty_list > > + * could have raced with mem_cgroup_replace_page_cache callers > > + * so the lru seemed empty but the page could have been added > > + * right after the check. RES_USAGE should be safe as we always > > + * charge before adding to the LRU. > > + */ > > + } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0); > > Maybe we want to trigger some warning if retry count gets too high? > At least for now? We can but is this really worth it? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>