Re: [BUG] ZSwap leaks memory upon being disabled

Konstantin Kharlamov <Hi-Angel@xxxxxxxxx> · Sun, 27 Oct 2024 03:29:05 +0300

On Sat, 2024-10-26 at 10:47 -0700, Yosry Ahmed wrote:
> On Sat, Oct 26, 2024 at 4:33 AM Konstantin Kharlamov
> <Hi-Angel@xxxxxxxxx> wrote:
> > 
> > On Fri, 2024-10-25 at 00:50 -0700, Yosry Ahmed wrote:
> > > On Thu, Oct 24, 2024 at 11:41 PM Konstantin Kharlamov
> > > <Hi-Angel@xxxxxxxxx> wrote:
> > > > 
> > > > On Thu, 2024-10-24 at 13:47 -0700, Yosry Ahmed wrote:
> > > > > On Thu, Oct 24, 2024 at 6:02 AM Konstantin Kharlamov
> > > > > <Hi-Angel@xxxxxxxxx> wrote:
> > > > > > 
> > > > > > When ZSWAP is disabled, the `Zswap` and `Zswapped` in
> > > > > > meminfo
> > > > > > are
> > > > > > still non-zero.
> > > > > > IOW, ZSWAP doesn't free memory upon being disabled.
> > > > > > 
> > > > > > Stumbled upon this while trying to figure out where did ≈4G
> > > > > > of
> > > > > > my
> > > > > > SWAP memory
> > > > > > disappear. Been seeing some unknown memory in SWAP for
> > > > > > years,
> > > > > > now I
> > > > > > suspect ZSWAP
> > > > > > might be the culprit. But no way to know for sure because
> > > > > > of
> > > > > > this
> > > > > > bug.
> > > > > > 
> > > > > > # Steps to reproduce
> > > > > > 
> > > > > > 1. Enable ZSWAP
> > > > > > 2. Wait for `grep Zswap /proc/meminfo` to become non-zero
> > > > > > 3. Disable ZSWAP via `sudo sh -c "echo 0 >
> > > > > > /sys/module/zswap/parameters/enabled"`
> > > > > > 4. Look at `grep Zswap /proc/meminfo`
> > > > > > 
> > > > > > ## Expected
> > > > > > 
> > > > > > The rows are zero because ZSWAP is disabled.
> > > > > 
> > > > > Not really, the expected behavior is that further swapouts
> > > > > will
> > > > > not
> > > > > go
> > > > > to zswap, but pages that are already compressed in zswap will
> > > > > not
> > > > > be
> > > > > written out to the backing swapfile or swapped back to
> > > > > memory. A
> > > > > swapoff would be required for the latter.
> > > > > 
> > > > > This is documented in:
> > > > > https://docs.kernel.org/admin-guide/mm/zswap.html#overview.
> > > > 
> > > > Oh, I see, thank you, sorry for the noise.
> > > > 
> > > > Then, I'm curious, is it correct to assume that this `Zswap`-
> > > > prefixed
> > > > memory mentioned in meminfo is never the one that is in SWAP? I
> > > > mean,
> > > > Zswap being a buffer before data goes to swap kind of implies
> > > > that
> > > > yes,
> > > > the data *either* in zswap or in swap. But just wanted to hear
> > > > that
> > > > explicitly.
> > > 
> > > I know this makes sense, but unfortunately no. Zswap is currently
> > > transparent to the rest of the system. For all intents and
> > > purposes,
> > > pages in zswap are considered in swap. You cannot even use zswap
> > > with
> > > an actual swapfile. So the zswap stats should be a subset of the
> > > swap
> > > stats.
> > > 
> > > FWIW, Nhat is working on restructuring this to have zswap be its
> > > own
> > > entity, separate from any swapfiles.
> > > 
> > > > 
> > > > The background to my question is that I'm trying to find the
> > > > culprit
> > > > some "phantom memory" eventually filling up my SWAP. This
> > > > memory is
> > > > not
> > > > one accounted to apps (as calculated via `smem`), nor to tmpfs.
> > > > So
> > > > my
> > > > next suspect was something related to ZSwap.
> > > > > 
> > > 
> > > As I mentioned, zswap should be transparent to the rest of the
> > > system,
> > > so it shouldn't make a difference in this case whether the pages
> > > are
> > > in zswap or in the swapfile.
> > > 
> > > You can use the memory.swap.current counter to find out which
> > > memory
> > > cgroup currently has swapped out pages (in zswap or in the
> > > swapfile).
> > > This should help find the application that has memory in swap. If
> > > you
> > > want to find the exact type of memory (e.g. anon vs tmpfs), that
> > > would
> > > be more tricky. Perhaps you can swapoff and see what counters
> > > increase
> > > in memory.stat of the relevant memory cgroup?
> > 
> > Thank you, so, I've waited till my SWAP gets almost full again
> > (apparently my new workflow triggers that a lot). It is 7.5G out of
> > 8
> > in total. 437M is taken by tmpfs'es, let's subtract for simplicity,
> > so
> > I have 7G taken by something else.
> 
> If the tmpfs's are created and written to by processes in the user
> slice, they should show up memory.swap.current as well.
> 
> > 
> > Now I'm looking at `/sys/fs/cgroup/user.slice/memory.swap.current`
> > and
> > it's 4422422528 = 4.1G. That's a lot less than 7G. I'm certain this
> 
> Can you check the memory.swap.current value of other slices?

That was a good idea! The
`/sys/fs/cgroup/system.slice/memory.swap.current` seems to have the
missing half of the SWAP memory. From my understanding of the
`systemctl status` graph `sytem.slice` and `user.slice` groups do not
intersect, and by adding up `system.slice/…` + `user.slice/…` I get
around 8G.

However, I'm still unclear what does this memory belong to.
`system.slice/memory.swap.current` is 4.4G currently, that's a lot and
I'm not seeing anything that could take so much memory.

An even larger related mystery is why does this memory not show up in
`smem` numbers for individual applications (which calculates it by
going over `/proc/$pid/smaps` for every pid).

> The other possibility is that the pages are swapped out from the root
> cgroup, in which case they won't show up in memory.swap.current as
> they are basically unaccounted. Although typically user processes
> should not be running in the root cgroup.
> 
> > "phantom swap memory" is hidden in `user.slice`, because if I wait
> > till
> > OOM-killer gets triggered and kills some app, my user-systemd gets
> > crashed for some reason, taking down the entire user session, and
> > afterwards SWAP is almost free.
> 
> Did you check the OOM logs? It is possible that the OOM killer kills
> some system process that has some memory in swap as well.

I did, logs are pretty uninteresting. OOM kills `electron` (of element-
desktop), but I tried closing it before the OOM, that didn't have much
influence. Just an arbitrary victim. Then a few lines later a `Process
560296 (systemd) of user 1000 terminated abnormally with signal
11/SEGV`. Wasn't able to get stacktrace for systemd with Archlinux's
debuginfo servers. And then everything gets down with systemd.

I just tried closing every application I have open and I still got 5.5
in SWAP. Well, obviously there are services still running, Plasma,
i3wm… Not many suspects left though.