On Fri, Jun 10, 2011 at 9:34 AM, Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > On Fri, Jun 10, 2011 at 08:47:55AM +0900, Minchan Kim wrote: >> On Fri, Jun 10, 2011 at 8:41 AM, Minchan Kim <minchan.kim@xxxxxxxxx> wrote: >> > On Fri, Jun 10, 2011 at 2:23 AM, Johannes Weiner <hannes@xxxxxxxxxxx> wrote: >> >> On Fri, Jun 10, 2011 at 12:48:39AM +0900, Minchan Kim wrote: >> >>> On Wed, Jun 01, 2011 at 08:25:13AM +0200, Johannes Weiner wrote: >> >>> > When a memcg hits its hard limit, hierarchical target reclaim is >> >>> > invoked, which goes through all contributing memcgs in the hierarchy >> >>> > below the offending memcg and reclaims from the respective per-memcg >> >>> > lru lists. ÂThis distributes pressure fairly among all involved >> >>> > memcgs, and pages are aged with respect to their list buddies. >> >>> > >> >>> > When global memory pressure arises, however, all this is dropped >> >>> > overboard. ÂPages are reclaimed based on global lru lists that have >> >>> > nothing to do with container-internal age, and some memcgs may be >> >>> > reclaimed from much more than others. >> >>> > >> >>> > This patch makes traditional global reclaim consider container >> >>> > boundaries and no longer scan the global lru lists. ÂFor each zone >> >>> > scanned, the memcg hierarchy is walked and pages are reclaimed from >> >>> > the per-memcg lru lists of the respective zone. ÂFor now, the >> >>> > hierarchy walk is bounded to one full round-trip through the >> >>> > hierarchy, or if the number of reclaimed pages reach the overall >> >>> > reclaim target, whichever comes first. >> >>> > >> >>> > Conceptually, global memory pressure is then treated as if the root >> >>> > memcg had hit its limit. ÂSince all existing memcgs contribute to the >> >>> > usage of the root memcg, global reclaim is nothing more than target >> >>> > reclaim starting from the root memcg. ÂThe code is mostly the same for >> >>> > both cases, except for a few heuristics and statistics that do not >> >>> > always apply. ÂThey are distinguished by a newly introduced >> >>> > global_reclaim() primitive. >> >>> > >> >>> > One implication of this change is that pages have to be linked to the >> >>> > lru lists of the root memcg again, which could be optimized away with >> >>> > the old scheme. ÂThe costs are not measurable, though, even with >> >>> > worst-case microbenchmarks. >> >>> > >> >>> > As global reclaim no longer relies on global lru lists, this change is >> >>> > also in preparation to remove those completely. >> >> >> >> [cut diff] >> >> >> >>> I didn't look at all, still. You might change the logic later patches. >> >>> If I understand this patch right, it does round-robin reclaim in all memcgs >> >>> when global memory pressure happens. >> >>> >> >>> Let's consider this memcg size unbalance case. >> >>> >> >>> If A-memcg has lots of LRU pages, scanning count for reclaim would be bigger >> >>> so the chance to reclaim the pages would be higher. >> >>> If we reclaim A-memcg, we can reclaim the number of pages we want easily and break. >> >>> Next reclaim will happen at some time and reclaim will start the B-memcg of A-memcg >> >>> we reclaimed successfully before. But unfortunately B-memcg has small lru so >> >>> scanning count would be small and small memcg's LRU aging is higher than bigger memcg. >> >>> It means small memcg's working set can be evicted easily than big memcg. >> >>> my point is that we should not set next memcg easily. >> >>> We have to consider memcg LRU size. >> >> >> >> I may be missing something, but you said yourself that B had a smaller >> >> scan count compared to A, so the aging speed should be proportional to >> >> respective size. >> >> >> >> The number of pages scanned per iteration is essentially >> >> >> >> Â Â Â Ânumber of lru pages in memcg-zone >> priority >> >> >> >> so we scan relatively more pages from B than from A each round. >> >> >> >> It's the exact same logic we have been applying traditionally to >> >> distribute pressure fairly among zones to equalize their aging speed. >> >> >> >> Is that what you meant or are we talking past each other? >> > >> > True if we can reclaim pages easily(ie, default priority) in all memcgs. >> > But let's think about it. >> > Normally direct reclaim path reclaims only SWAP_CLUSTER_MAX size. >> > If we have small memcg, scan window size would be smaller and it is >> > likely to be hard reclaim in the priority compared to bigger memcg. It >> > means it can raise priority easily in small memcg and even it might >> > call lumpy or compaction in case of global memory pressure. It can >> > churn all LRU order. :( >> > Of course, we have bailout routine so we might make such unfair aging >> > effect small but it's not same with old behavior(ie, single LRU list, >> > fair aging POV global according to priority raise) >> >> To make fair, how about considering turn over different memcg before >> raise up priority? >> It can make aging speed fairly while it can make high contention of >> lru_lock. :( > > Actually, the way you describe it is how it used to work for limit > reclaim before my patches. ÂIt would select one memcg, then reclaim > with increasing priority until SWAP_CLUSTER_MAX were reclaimed. > > Â Â Â Âmemcg = select_victim() > Â Â Â Âfor each prio: > Â Â Â Â Âfor each zone: > Â Â Â Â Â Âshrink_zone(prio, zone, sc = { .mem_cgroup = memcg }) > > What it's supposed to do with my patches is scan all memcgs in the > hierarchy at the same priority. ÂIf it hasn't made progress, it will > increase the priority and iterate again over the hierarchy. > > Â Â Â Âfor each prio: > Â Â Â Â Âfor each zone: > Â Â Â Â Â Âfor each memcg: > Â Â Â Â Â Â Âdo_shrink_zone(prio, zone, sc = { .mem_cgroup = memcg }) > > Right you are. I got confused with old behavior which wasn't good. Your way is very desirable to me and my concern disappear. Thanks, Hannes. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href