2024年6月12日(水) 3:26 Nhat Pham <nphamcs@xxxxxxxxx>: > > As I have noted in v0, I think this is unnecessary and makes it more confusing. > Does spin_lock() ensure that compiler optimizations do not remove memory access to an external variable? I think we need to use READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock. For example, https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234 isn't this a common use case of READ_ONCE? ```c bool shared_flag = false; spinlock_t flag_lock; void somefunc(void) { for (;;) { spin_lock(&flag_lock); /* check external updates */ if (READ_ONCE(shared_flag)) break; /* do something */ spin_unlock(&flag_lock); } spin_unlock(&flag_lock); } ``` Without READ_ONCE, the check can be extracted from the loop by optimization. In shrink_worker, zswap_next_shrink is the shared_flag , which can be updated by concurrent cleaner threads, so it must be re-read every time we reacquire the lock. Am I badly misunderstanding something? > > do { > > +iternext: > > spin_lock(&zswap_shrink_lock); > > - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); > > - memcg = zswap_next_shrink; > > + next_memcg = READ_ONCE(zswap_next_shrink); > > + > > + if (memcg != next_memcg) { > > + /* > > + * Ours was released by offlining. > > + * Use the saved memcg reference. > > + */ > > + memcg = next_memcg; > > + } else { > > + /* advance cursor */ > > + memcg = mem_cgroup_iter(NULL, memcg, NULL); > > + WRITE_ONCE(zswap_next_shrink, memcg); > > + } > > I suppose I'm fine with not advancing the memcg when it is already > advanced by the memcg offlining callback. > For where to restart the shrinking, as Yosry pointed, my version starts from the last memcg (=retrying failed memcg or evicting once more) I now realize that skipping the next memcg of offlined memcg is less likely to happen. I am reverting it to restart from the next memcg of zswap_next_shrink. Which one could be better? > > > > /* > > - * We need to retry if we have gone through a full round trip, or if we > > - * got an offline memcg (or else we risk undoing the effect of the > > - * zswap memcg offlining cleanup callback). This is not catastrophic > > - * per se, but it will keep the now offlined memcg hostage for a while. > > - * > > * Note that if we got an online memcg, we will keep the extra > > * reference in case the original reference obtained by mem_cgroup_iter > > * is dropped by the zswap memcg offlining callback, ensuring that the > > @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w) > > } > > > > if (!mem_cgroup_tryget_online(memcg)) { > > - /* drop the reference from mem_cgroup_iter() */ > > - mem_cgroup_iter_break(NULL, memcg); > > - zswap_next_shrink = NULL; > > + /* > > + * It is an offline memcg which we cannot shrink > > + * until its pages are reparented. > > + * > > + * Since we cannot determine if the offline cleaner has > > + * been already called or not, the offline memcg must be > > + * put back unconditonally. We cannot abort the loop while > > + * zswap_next_shrink has a reference of this offline memcg. > > + */ > > spin_unlock(&zswap_shrink_lock); > > - > > - if (++failures == MAX_RECLAIM_RETRIES) > > - break; > > - > > - goto resched; > > + goto iternext; > > Hmmm yeah in the past, I set it to NULL to make sure we're not > replacing zswap_next_shrink with an offlined memcg, after that zswap > offlining callback for that memcg has been completed.. > > I suppose we can just call mem_cgroup_iter(...) on that offlined > cgroup, but I'm not 100% sure what happens when we call this function > on a cgroup that is currently being offlined, and has gone past the > zswap offline callback stage. So I was just playing it safe and > restart from the top of the tree :) > > I think this implementation has that behavior right? We see that the > memcg is offlined, so we drop the lock and go to the beginning of the > loop. We reacquire the lock, and might see that zswap_next_shrink == > memcg, so we call mem_cgroup_iter(...) on it. Is this safe? > > Note that zswap_shrink_lock only orders serializes this memcg > selection loop with memcg offlining after it - there's no guarantee > what's the behavior is for memcg offlining before it (well other than > one reference that we manage to acquire thanks to > mem_cgroup_iter(...), so that memcg has not been freed, but not sure > what we can guarantee regarding its place in the memcg hierarchy > tree?). The locking mechanism in shrink_worker does not rely on what the next memcg is.sorting stability of mem_cgroup_iter does not matter here. The expectation for the iterator is that it will walk through all live memcgs. I believe mem_cgroup_iter uses parent-to-leaf ordering of cgroup and it ensures all live cgroups are walked at least once, regardless of its onlineness. https://elixir.bootlin.com/linux/v6.10-rc2/source/mm/memcontrol.c#L1368 Regarding reference leak, I overlooked a scenario where a leak might occur in the existing cleaner. although it should be rare. When the cleaner is called on a memcg in zswap_next_shrink, the next memcg from mem_cgroup_iter() can be an offline already-cleaned memcg, resulting in a reference leak of the next memcg from the cleaner. We should implement the same online check in the cleaner, like this: ```c void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) { struct mem_cgroup *next; /* lock out zswap shrinker walking memcg tree */ spin_lock(&zswap_shrink_lock); if (zswap_next_shrink == memcg) { next = zswap_next_shrink; do { next = mem_cgroup_iter(NULL, next, NULL); WRITE_ONCE(zswap_next_shrink, next); spin_unlock(&zswap_shrink_lock); /* zswap_next_shrink might be updated here */ spin_lock(&zswap_shrink_lock); next = READ_ONCE(zswap_next_shrink); if (!next) break; } while (!mem_cgroup_online(next)); /* * We verified the next memcg is online under lock. * Even if the next memcg is being offlined here, another * cleaner for the next memcg is waiting for our unlock just * behind us. We can leave the next memcg reference. */ } spin_unlock(&zswap_shrink_lock); } ``` As same as in shrink_worker, we must check if the next memcg is online under the lock before leaving the ref in zswap_next_shrink. Otherwise, zswap_next_shrink might hold the ref of offlined and cleaned memcg. Or if you are concerning about temporary storing unchecked or offlined memcg in zswap_next_shrink, it is safe because: 1. If there is no other cleaner running for zswap_next_shrink, the ref saved in zswap_next_shrink ensures liveness of the memcg when reacquired. 2. Another cleaner thread may put back and replace zswap_next_shrink with its next. We will check onlineness of the new zswap_next_shrink under reacquired lock. 3. Even if the verified-online memcg is being offlined concurrently, another cleaner thread must wait for our unlock. We can leave the online memcg and rely on its respective cleaner.