On 12/12/2013 01:50 PM, Vladimir Davydov wrote: >>>>> +int memcg_list_lru_init(struct list_lru *lru) >>>>> +{ >>>>> + int err = 0; >>>>> + int i; >>>>> + struct mem_cgroup *memcg; >>>>> + >>>>> + lru->memcg = NULL; >>>>> + lru->memcg_old = NULL; >>>>> + >>>>> + mutex_lock(&memcg_create_mutex); >>>>> + if (!memcg_kmem_enabled()) >>>>> + goto out_list_add; >>>>> + >>>>> + lru->memcg = kcalloc(memcg_limited_groups_array_size, >>>>> + sizeof(*lru->memcg), GFP_KERNEL); >>>>> + if (!lru->memcg) { >>>>> + err = -ENOMEM; >>>>> + goto out; >>>>> + } >>>>> + >>>>> + for_each_mem_cgroup(memcg) { >>>>> + int memcg_id; >>>>> + >>>>> + memcg_id = memcg_cache_id(memcg); >>>>> + if (memcg_id < 0) >>>>> + continue; >>>>> + >>>>> + err = list_lru_memcg_alloc(lru, memcg_id); >>>>> + if (err) { >>>>> + mem_cgroup_iter_break(NULL, memcg); >>>>> + goto out_free_lru_memcg; >>>>> + } >>>>> + } >>>>> +out_list_add: >>>>> + list_add(&lru->list, &all_memcg_lrus); >>>>> +out: >>>>> + mutex_unlock(&memcg_create_mutex); >>>>> + return err; >>>>> + >>>>> +out_free_lru_memcg: >>>>> + for (i = 0; i < memcg_limited_groups_array_size; i++) >>>>> + list_lru_memcg_free(lru, i); >>>>> + kfree(lru->memcg); >>>>> + goto out; >>>>> +} >>>> That will probably scale even worse. Think about what happens when we >>>> try to mount a bunch of filesystems in parallel - they will now >>>> serialise completely on this memcg_create_mutex instantiating memcg >>>> lists inside list_lru_init(). >>> Yes, the scalability seems to be the main problem here. I have a couple >>> of thoughts on how it could be improved. Here they go: >>> >>> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an >>> rw-semaphore, which we would take for modifying list_lru's) and take it >>> for reading in memcg_list_lru_init() and for writing when we create a >>> new memcg (memcg_init_all_lrus()). >>> This would remove the bottleneck from the mount path, but every memcg >>> creation would still iterate over all LRUs under a memcg mutex. So I >>> guess it is not an option, isn't it? >> Right - it's not so much that there is a mutex to protect the init, >> it's how long it's held that will be the issue. I mean, we don't >> need to hold the memcg_create_mutex until we've completely >> initialised the lru structure and are ready to add it to the >> all_memcg_lrus list, right? >> >> i.e. restructing it so that you don't need to hold the mutex until >> you make the LRU list globally visible would solve the problem just >> as well. if we can iterate the memcgs lists without holding a lock, >> then we can init the per-memcg lru lists without holding a lock >> because nobody will access them through the list_lru structure >> because it's not yet been published. >> >> That keeps the locking simple, and we get scalability because we've >> reduced the lock's scope to just a few instructures instead of a >> memcg iteration and a heap of memory allocation.... > Unfortunately that's not that easy as it seems to be :-( > > Currently I hold the memcg_create_mutex while initializing per-memcg > LRUs in memcg_list_lru_init() in order to be sure that I won't miss a > memcg that is added during initialization. > > I mean, let's try to move per-memcg LRUs allocation outside the lock and > only register the LRU there: > > memcg_list_lru_init(): > 1) allocate lru->memcg array > 2) for_each_kmem_active_memcg(m) > allocate lru->memcg[m] > 3) lock memcg_create_mutex > add lru to all_memcg_lrus_list > unlock memcg_create_mutex > > Then if a new kmem-active memcg is added after step 2 and before step 3, > it won't see the new lru, because it has not been registered yet, and > thus won't initialize its list in this lru. As a result, we will end up > with a partially initialized list_lru. Note that this will happen even > if the whole memcg initialization proceeds under the memcg_create_mutex. > > Provided we could freeze memcg_limited_groups_array_size, it would be > possible to fix this problem by swapping steps 2 and 3 and making step 2 > initialize lru->memcg[m] using cmpxchg() only if it was not initialized. > However we still have to hold the memcg_create_mutex during the whole > kmemcg activation path (memcg_init_all_lrus()). > > Let's see if we can get rid of the lock in memcg_init_all_lrus() by > making the all_memcg_lrus RCU-protected so that we could iterate over > all list_lrus w/o holding any locks and turn memcg_init_all_lrus() to > something like this: > > memcg_init_all_lrus(): > 1) for_each_list_lru_rcu(lru) > allocate lru->memcg[new_memcg_id] > 2) mark new_memcg as kmem-active > > The problem is that if memcg_list_lru_init(new_lru) starts and completes > between steps 1 and 2, we will not initialize > new_lru->memcg[new_memcg_id] neither in memcg_init_all_lrus() nor in > memcg_list_lru_init(). > > The problem here is that on kmemcg creation (memcg_init_all_lrus()) we > have to iterate over all list_lrus while on list_lru creation > (memcg_list_lru_init()) we have to iterate over all memcgs. Currently I > can't figure out how we can do it w/o holding any mutexes at least while > calling one of these functions, but I'm keep thinking on it. > Seems I got it. We could add a memcg state bit, say "activating", meaning that a memcg is going to become kmem-active, but it is not yet, so it should not be accounted to; keep all list_lrus in a rcu-protected list; and implement memcg_init_all_lrus() and memcg_list_lru_init() as follows: memcg_init_all_lrus(): set activating for_each_list_lru_rcu(lru) cmpxchg(&lru->memcg[new_memcg_id], NULL, new list_lru_node); set active memcg_list_lru_init(): add the new_lru to the all_memcg_lrus list for_each_memcg(memcg): if memcg is activating or active: cmpxchg(&new_lru->memcg[memcg_id], NULL, new list_lru_node) At first glance, it looks correct, because: If we skip an lru while iterating over all_memcg_lrus in memcg_init_all_lrus(), it means it was created after the "activating" bit had been set and thus the per-memcg lru will be initialized in memcg_list_lru_init(). If we skip a memcg in memcg_list_lru_init(), this memg will "see" this lru while iterating over the all_memcg_lrus list, because the lru must have been created before the "activating" bit set. Although it is to be elaborated, because I haven't examined the destruction paths yet, and this doesn't take into account the fact that memcg_limited_groups_array_size is not a constant (I guess I'll have to introduce an rw semaphore for it, but it'll be OK, because its updates are very-very rare), I guess I am on the right way. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>