> > [..] > > > +/********************************* > > > +* lru functions > > > +**********************************/ > > > +static bool zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry) > > > +{ > > > + struct mem_cgroup *memcg = get_mem_cgroup_from_entry(entry); > > > + int nid = entry_to_nid(entry); > > > + bool added = list_lru_add(list_lru, &entry->lru, nid, memcg); > > > + > > > + mem_cgroup_put(memcg); > > > > Still not fond of the get/put pattern but okay.. > > Actually, Johannes and I took another look to see if we can replace > the memcg reference getting with just rcu_read_lock(). > > It seems there might be a race between zswap LRU manipulation > and memcg offlining - not just with the rcu_read_lock() idea, but also > with our current implementation! > > I'll shoot another email with more details later when I'm sure of it > one way or another... > Interesting, well at least something came out of my complaining :) > > [..] > > > @@ -652,28 +679,37 @@ static int zswap_reclaim_entry(struct zswap_pool *pool) > > > */ > > > swpoffset = swp_offset(entry->swpentry); > > > tree = zswap_trees[swp_type(entry->swpentry)]; > > > - spin_unlock(&pool->lru_lock); > > > + list_lru_isolate(l, item); > > > + /* > > > + * It's safe to drop the lock here because we return either > > > + * LRU_REMOVED_RETRY or LRU_RETRY. > > > + */ > > > + spin_unlock(lock); > > > > > > /* Check for invalidate() race */ > > > spin_lock(&tree->lock); > > > - if (entry != zswap_rb_search(&tree->rbroot, swpoffset)) { > > > - ret = -EAGAIN; > > > + if (entry != zswap_rb_search(&tree->rbroot, swpoffset)) > > > goto unlock; > > > - } > > > + > > > /* Hold a reference to prevent a free during writeback */ > > > zswap_entry_get(entry); > > > spin_unlock(&tree->lock); > > > > > > - ret = zswap_writeback_entry(entry, tree); > > > + writeback_result = zswap_writeback_entry(entry, tree); > > > > > > spin_lock(&tree->lock); > > > - if (ret) { > > > - /* Writeback failed, put entry back on LRU */ > > > - spin_lock(&pool->lru_lock); > > > - list_move(&entry->lru, &pool->lru); > > > - spin_unlock(&pool->lru_lock); > > > + if (writeback_result) { > > > + zswap_reject_reclaim_fail++; > > > + memcg = get_mem_cgroup_from_entry(entry); > > > > Can this return NULL? Seems like we don't check the return in most/all places. > > I believe so, but memcg experts should fact check me on this. If that's the case, there should be NULL checks, no? > It's roughly the same pattern as zswap charging/uncharging: > > obj_cgroup_uncharge_zswap(entry->objcg, entry->length) > -> getting memcg (under rcu_read_lock()) > > > > > > + spin_lock(lock); > > > + /* we cannot use zswap_lru_add here, because it increments node's lru count */ > > > + list_lru_putback(&entry->pool->list_lru, item, entry_to_nid(entry), memcg); > > > > Perhaps we can move this call with the memcg get/put to a helper like > > add/del? (e.g. zswap_lru_putback) > > > > We would need to move get_mem_cgroup_from_entry() into the lock but I > > think that's okay. > > We probably could, but that sounds like extra code for not a lot of gains, no? I don't feel strongly, just a fan of consistency. > > > > > > + spin_unlock(lock); > > > + mem_cgroup_put(memcg); > > > + ret = LRU_RETRY; > > > goto put_unlock; > > > } > > > + zswap_written_back_pages++; > > > > > > /* > > > * Writeback started successfully, the page now belongs to the [..] > > > @@ -696,15 +759,17 @@ static void shrink_worker(struct work_struct *w) > > > shrink_work); > > > int ret, failures = 0; > > > > > > + /* global reclaim will select cgroup in a round-robin fashion. */ > > > do { > > > - ret = zswap_reclaim_entry(pool); > > > - if (ret) { > > > - zswap_reject_reclaim_fail++; > > > - if (ret != -EAGAIN) > > > - break; > > > - if (++failures == MAX_RECLAIM_RETRIES) > > > - break; > > > - } > > > + pool->next_shrink = mem_cgroup_iter(NULL, pool->next_shrink, NULL); > > > > I think this can be a problem. We hold a ref to a memcg here until the > > next time we shrink, which can be a long time IIUC. This can cause the > > memcg to linger as a zombie. I understand it is one memcg per-zswap > > pool, but I am still unsure about it. > > > > MGLRU maintains a memcg LRU for global reclaim that gets properly > > cleaned up when a memcg is going away, so that's one option, although > > complicated. > > > > A second option would be to hold a pointer to the objcg instead, which > > should be less problematic (although we are still holding that objcg > > hostage indefinitely). The problem here is that if the objcg gets > > reparented, next time we will start at the parent of the memcg we > > stopped at last time, which tbh doesn't sound bad at all to me. > > > > A third option would be to flag the memcg such that when it is getting > > offlined we can call into zswap to reset pool->next_shrink (or move it > > to the parent) and drop the ref. Although synchronization can get > > hairy when racing with offlining. > > > > Not sure what's the right solution, but I prefer we don't hold any > > memcgs hostages indefinitely. I also think if we end up using > > mem_cgroup_iter() then there should be a mem_cgroup_iter_break() > > somewhere if/when breaking the iteration. > > > > I'm not sure if this is that big of a problem in the first place, but > if it is, doing something similar to MGLRU is probably the cleanest: > when the memcg is freed, trigger the zswap_exit_memcg() callback, > which will loop through all the zswap pools and update pool->next_shrink > where appropriate. > > Note that we only have one pool per (compression algorithm x allocator) > combinations, so there cannot be that many pools, correct? > > Johannes suggests this idea to me (my apologies if I butcher it) > during one of our conversations. That sounds relatively easy IIUC. Be careful that there will be a race between memcg offlining and zswap's usage of pool->next_shrink. AFAICT there is no lock to prevent offlining so there will have to be some sort of dance here to make sure everything works correctly.