On Tue, 7 Dec 2010 17:24:12 -0800 Ying Han <yinghan@xxxxxxxxxx> wrote: > On Tue, Dec 7, 2010 at 4:39 PM, KAMEZAWA Hiroyuki > <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote: > > On Tue, 7 Dec 2010 09:28:01 -0800 > > Ying Han <yinghan@xxxxxxxxxx> wrote: > > > >> On Tue, Dec 7, 2010 at 4:33 AM, Mel Gorman <mel@xxxxxxxxx> wrote: > > > >> Potentially there will > >> > also be a very large number of new IO sources. I confess I haven't read the > >> > thread yet so maybe this has already been thought of but it might make sense > >> > to have a 1:N relationship between kswapd and memcgroups and cycle between > >> > containers. The difficulty will be a latency between when kswapd wakes up > >> > and when a particular container is scanned. The closer the ratio is to 1:1, > >> > the less the latency will be but the higher the contenion on the LRU lock > >> > and IO will be. > >> > >> No, we weren't talked about the mapping anywhere in the thread. Having > >> many kswapd threads > >> at the same time isn't a problem as long as no locking contention ( > >> ext, 1k kswapd threads on > >> 1k fake numa node system). So breaking the zone->lru_lock should work. > >> > > > > That's me who make zone->lru_lock be shared. And per-memcg lock will makes > > the maintainance of memcg very bad. That will add many races. > > Or we need to make memcg's LRU not synchronized with zone's LRU, IOW, we need > > to have completely independent LRU. > > > > I'd like to limit the number of kswapd-for-memcg if zone->lru lock contention > > is problematic. memcg _can_ work without background reclaim. > > > > > How about adding per-node kswapd-for-memcg it will reclaim pages by a memcg's > > request ? as > > > > Â Â Â Âmemcg_wake_kswapd(struct mem_cgroup *mem) > > Â Â Â Â{ > > Â Â Â Â Â Â Â Âdo { > > Â Â Â Â Â Â Â Â Â Â Â Ânid = select_victim_node(mem); > > Â Â Â Â Â Â Â Â Â Â Â Â/* ask kswapd to reclaim memcg's memory */ > > Â Â Â Â Â Â Â Â Â Â Â Âret = memcg_kswapd_queue_work(nid, mem); /* may return -EBUSY if very busy*/ > > Â Â Â Â Â Â Â Â} while() > > Â Â Â Â} > > > > This will make lock contention minimum. Anyway, using too much cpu for this > > unnecessary_but_good_for_performance_function is bad. Throttoling is required. > > I don't see the problem of one-kswapd-per-cgroup here since there will > be no performance cost if they are not running. > Yes. But we've got a report from user who uses 2000+ cgroups on his host, one year ago. (in libcgroup mailing list.) So, running 2000+ deadly thread will be bad. It's cost. In theory, the number of memcg can be 65534. > I haven't measured the lock contention and cputime for each kswapd > running. Theoretically it would be a problem > if thousands of cgroups are configured on the the host and all of them > are under memory pressure. > I think that's a configuration mistake. > We can either optimize the locking or make each kswapd smarter (hold > the lock less time). My current plan is to have the > one-kswapd-per-cgroup on the V2 patch w/ select_victim_node, and the > optimization for this comes as following patchset. > My point above is holding remove node's lock, touching remote node's page increases memory reclaim cost very much. Then, I like per-node approach. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>