On Fri, Apr 19, 2024 at 7:22 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Thu, Apr 18, 2024 at 01:09:22PM -0700, Yosry Ahmed wrote: > > On Thu, Apr 18, 2024 at 5:40 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > > > On Wed, Apr 17, 2024 at 07:18:14PM +0200, Christian Heusel wrote: > > > > On 24/04/17 10:33AM, Johannes Weiner wrote: > > > > > > > > > > Christian, can you please test the below patch on top of current > > > > > upstream? > > > > > > > > > > > > > Hey Johannes, > > > > > > > > I have applied your patch on top of 6.9-rc4 and it did solve the crash for > > > > me, thanks for hacking together a fix so quickly! 🤗 > > > > > > > > Tested-By: Christian Heusel <christian@xxxxxxxxx> > > > > > > Thanks for confirming it, and sorry about the breakage. > > > > > > Andrew, can you please use the updated changelog below? > > > > > > --- > > > > > > From 52f67f5fab6a743c2aedfc8e04a582a9d1025c28 Mon Sep 17 00:00:00 2001 > > > From: Johannes Weiner <hannes@xxxxxxxxxxx> > > > Date: Thu, 18 Apr 2024 08:26:28 -0400 > > > Subject: [PATCH] mm: zswap: fix shrinker NULL crash with cgroup_disable=memory > > > > > > Christian reports a NULL deref in zswap that he bisected down to the > > > zswap shrinker. The issue also cropped up in the bug trackers of > > > libguestfs [1] and the Red Hat bugzilla [2]. > > > > > > The problem is that when memcg is disabled with the boot time flag, > > > the zswap shrinker might get called with sc->memcg == NULL. This is > > > okay in many places, like the lruvec operations. But it crashes in > > > memcg_page_state() - which is only used due to the non-node accounting > > > of cgroup's the zswap memory to begin with. > > > > > > Nhat spotted that the memcg can be NULL in the memcg-disabled case, > > > and I was then able to reproduce the crash locally as well. > > > > > > [1] https://github.com/libguestfs/libguestfs/issues/139 > > > [2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252 > > > > > > Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure") > > > Cc: stable@xxxxxxxxxxxxxxx [v6.8] > > > Link: https://lkml.kernel.org/r/20240417143324.GA1055428@xxxxxxxxxxx > > > Reported-by: Christian Heusel <christian@xxxxxxxxx> > > > Debugged-by: Nhat Pham <nphamcs@xxxxxxxxx> > > > Suggested-by: Nhat Pham <nphamcs@xxxxxxxxx> > > > Tested-By: Christian Heusel <christian@xxxxxxxxx> > > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> > > > > Thanks for fixing this. A couple of comments/questions below, but anyway LGTM: > > > > Acked-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > > > > --- > > > mm/zswap.c | 25 ++++++++++++++++--------- > > > 1 file changed, 16 insertions(+), 9 deletions(-) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index caed028945b0..6f8850c44b61 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -1331,15 +1331,22 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, > > > if (!gfp_has_io_fs(sc->gfp_mask)) > > > return 0; > > > > > > -#ifdef CONFIG_MEMCG_KMEM > > > - mem_cgroup_flush_stats(memcg); > > > - nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; > > > - nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); > > > -#else > > > - /* use pool stats instead of memcg stats */ > > > - nr_backing = zswap_pool_total_size >> PAGE_SHIFT; > > > - nr_stored = atomic_read(&zswap_nr_stored); > > > -#endif > > > + /* > > > + * For memcg, use the cgroup-wide ZSWAP stats since we don't > > > + * have them per-node and thus per-lruvec. Careful if memcg is > > > + * runtime-disabled: we can get sc->memcg == NULL, which is ok > > > + * for the lruvec, but not for memcg_page_state(). > > > + * > > > + * Without memcg, use the zswap pool-wide metrics. > > > + */ > > > + if (!mem_cgroup_disabled()) { > > > > With the current shrinker code, it seems like we cannot get sc->memcg > > == NULL unless mem_cgroup_disabled() is true indeed. However, maybe > > it's better to check if sc->memcg is not NULL directly instead? > > > > This would be more resilient in case the shrinker code changes and > > passing sc->memcg == NULL becomes possible in other cases (e.g. during > > global shrinking). It seems like other shrinkers do this, for example > > see count_shadow_nodes() and deferred_split_count(). > > Eh, I'm not sure it's better or worse, so it's a bit hard to care. We > shouldn't get NULL here when memcg is enabled, and if somebody > introduces that bug it's better to catch it early than run into subtle > priority inversions when the kernel is deployed to millions of hosts. No strong opinion here, I just thought consistency with other shrinkers would be nice. > > > I am also wondering if we should also check !mem_cgroup_is_root() > > here? We can avoid the expensive global flush and use the global stats > > directly in this case. I could also send a follow up patch for this if > > that's preferred. > > I'd rather not proliferate more memcg internals in this code. If this > is a concern, optimizing it in the flush and stat functions would make > more sense. Reclaim already flushes the subtree before getting here, > so odds are good this is a no-op in most cases. I agree that with per-memcg thresholding, it is very unlikely that this flush does anything. FWIW, if it turns out to be a problem, optimizing it in the flush functions would not be as straightforward, because the memcg code is not aware that zswap has up-to-date global versions of the stats. Anyway, let's table this until someone actually complains, just seemed to me like a low hanging fruit.