> On Mar 30, 2022, at 7:14 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Wed, 30 Mar 2022 13:26:46 -0400 Waiman Long <longman@xxxxxxxxxx> wrote: > >> Muchun Song found out there could be a race between list_lru_add() >> and memcg_reparent_list_lru_node() causing the later function to miss >> reparenting of a lru entry as shown below: >> >> CPU0: CPU1: >> list_lru_add() >> spin_lock(&nlru->lock) >> l = list_lru_from_kmem(memcg) >> memcg_reparent_objcgs(memcg) >> memcg_reparent_list_lrus(memcg) >> memcg_reparent_list_lru() >> memcg_reparent_list_lru_node() >> if (!READ_ONCE(nlru->nr_items)) >> // Miss reparenting >> return >> // Assume 0->1 >> l->nr_items++ >> // Assume 0->1 >> nlru->nr_items++ >> >> Though it is not likely that a list_lru_node that has 0 item suddenly >> has a newly added lru entry at the end of its life. The race is still >> theoretically possible. >> >> With the lock/unlock pair used within the percpu_ref_kill() which is >> the last function call of memcg_reparent_objcgs(), any read issued >> in memcg_reparent_list_lru_node() will not be reordered before the >> reparenting of objcgs. >> >> Adding a !spin_is_locked()/smp_rmb()/!READ_ONCE(nlru->nr_items) check >> to ensure that either the reading of nr_items is valid or the racing >> list_lru_add() will see the reparented objcg. >> >> ... >> >> --- a/mm/list_lru.c >> +++ b/mm/list_lru.c >> @@ -395,10 +395,33 @@ static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid, >> struct list_lru_one *src, *dst; >> >> /* >> - * If there is no lru entry in this nlru, we can skip it immediately. >> + * With the lock/unlock pair used within the percpu_ref_kill() >> + * which is the last function call of memcg_reparent_objcgs(), any >> + * read issued here will not be reordered before the reparenting >> + * of objcgs. >> + * >> + * Assuming a racing list_lru_add(): >> + * list_lru_add() >> + * <- memcg_reparent_list_lru_node() >> + * spin_lock(&nlru->lock) >> + * l = list_lru_from_kmem(memcg) >> + * nlru->nr_items++ >> + * spin_unlock(&nlru->lock) >> + * <- memcg_reparent_list_lru_node() >> + * >> + * The !spin_is_locked(&nlru->lock) check is true means it is >> + * either before the spin_lock() or after the spin_unlock(). In the >> + * former case, list_lru_add() will see the reparented objcg and so >> + * won't touch the lru to be reparented. In the later case, it will >> + * see the updated nr_items. So we can use the optimization that if >> + * there is no lru entry in this nlru, skip it immediately. >> */ >> - if (!READ_ONCE(nlru->nr_items)) >> - return; >> + if (!spin_is_locked(&nlru->lock)) { > > ick. > >> + /* nr_items read must be ordered after nlru->lock */ >> + smp_rmb(); >> + if (!READ_ONCE(nlru->nr_items)) >> + return; >> + } > > include/linux/spinlock_up.h has > > #define arch_spin_is_locked(lock) ((void)(lock), 0) > > so this `if' will always be true on CONFIG_SMP=n. Will the kernel > still work? I guess yes, because this race is not possible on a !smp machine. > > At the very least let's have changelogging and commenting explaining > that we've actually thought about this. > > Preferably, can we fix this hole properly and avoid this hack? There is > a reason for this: > > hp2:/usr/src/25> grep spin_is_locked mm/*.c > hp2:/usr/src/25> But honestly, I’d drop the original optimization together with the fix, if only there is no _real world_ data on the problem and the improvement. It seems like it has started as a nice simple improvement, but the race makes it complex and probably not worth the added complexity and fragility. Thanks!