Re: preview of swap on ZRAM feature/change

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Mon, 1 Jun 2020 15:06:22 -0600

On Mon, Jun 1, 2020 at 12:44 PM Simo Sorce <simo@xxxxxxxxxx> wrote:
>
> On Mon, 2020-06-01 at 10:37 -0600, Chris Murphy wrote:
> > Thanks for the early feedback!
> >
> > On Mon, Jun 1, 2020 at 7:58 AM Stephen Gallagher <sgallagh@xxxxxxxxxx> wrote:
> > > * Reading through the Change, you write:
> > > "using a ZRAM to RAM ratio of 1:2, and capped† to 4GiB" and then you
> > > talk about examples which are using 50% of RAM as ZRAM. Which is it? A
> > > ratio of 1:2 implies using 33% of RAM as ZRAM.
> >
> > This ratio is just a fraction, part of whole, where RAM is the whole.
> > This convention is used in the zram (package).
> >
> > Note that /dev/zram0 is a virtual block device, similar to the
> > 'lvcreate -V' option for thin volumes, size is a fantasy. And the ZRAM
> > device size is not a preallocation of memory. If the compression ratio
> > 2:1 (i.e. 200%) holds, then a ZRAM device sized to 50% of RAM will not
> > use more than 25% of RAM.
>
> What happen if you can't compress memory at all ?
> Will zram use more memory? Or will it simply become useless (but
> hopefully harmless) churn ?

It is not a no op. There is CPU and memory consumption in this case.
It actually reduces available memory for the workload. I haven't yet
seen this in practice and haven't come up with a synthetic test -
maybe something that just creates a bunch of anonymous pages using
/dev/urandom?

Since the device by default is small, it ends up performance wise
being a no op. You either arrive at the same oom you would have, in
the no swap at all case; or it'll just start spilling over into a
swap-on-disk if you have one still.

I have done quite a lot of testing of the webkit gtk compile case,
where it uses ncpus + 2 for the number of jobs by default, and where
it gets to a point eventually needing up to 1.5 GiB per job. Super
memory hungry.

8 GiB RAM + no swap = this sometimes triggers kernel oomkiller
quickly, but sometimes just sits there for a really long time before
it triggers, it does always eventually trigger. With earlyoom enabled
(default on Workstation) the oom happens faster, usually within 5
minutes.

8 GiB RAM + 8 GiB swap-on-disk = this sometimes but far less often
results in kernel oomkiller trigger; most often it sits in
pageout/pagein for 30+ minutes with a totally frozen GUI. With
earlyoom enabled, is consistently killed inside of 10 minutes.

8 GiB RAM + 8 GiB swap-on-ZRAM = exact reverse: sometimes but less
often results in 30+ minute hangs with frozen GUI, usually results in
kernel oom killer within 5 minutes. With earlyoom enabled consistently
is killed inside of 5 minutes.

8 GiB RAM + 16 GiB swap-on-disk = consistently finishes the compile.

8 GiB RAM + 16 GiB swap-on-ZRAM =  my log doesn't have this test. I
thought I had done it. But I think it's a risky default configuration
because it's if you don't get 2:1 compression and the task really
needs this much RAM, it's not just IO churn like with a disk based
swap. It's memory and CPU, and if it gets wedged in, it's a forced
power off. That's basically where we're at with Workstation edition
before earlyoom, which is not good, but not a huge problem like it is
with servers where you have to send someone to go hit a power button.
In these cases, sshd often will not respond before timeout. So no
sysrq+b unless you have that command pretyped out and ready to hit
enter.

The scenario where you just don't have the budget for the correct
memory for the workload, and you have to use swap contrary to the "in
defense of swap" article referenced in the change? I think it's maybe
better use case for zswap? I don't have tests that conclusively prove
that zswap's LRU basis for eviction from the zswap memory pool to the
disk swap is better than how the kernel deals with two swaps (zram and
disk case). But in theory the LRU basis is smarter.

Making it easier for folks to experiment with this I think is maybe
undersold in the proposal. But the main idea is to convey that the
proposed defaults are safe. Later in the proposal I propose they might
be too safe, with the 4GiB cap. That might be refused in favor of 50%
RAM across the board. But that could be a future enhancement if this
proposal is accepted.

>
> > I'll try to clear this up somehow; probably avoid using the term ratio
> > and just go with fraction/percentage. And also note the use of
> > 'zramctl' to see the actual compression ratio.
> >
> > > * This Change implies the de facto death of hibernation in Fedora.
> > > Good riddance, IMHO. It never worked safely.
> >
> > UEFI Secure Boot put us on this path. There's still no acceptable
> > authenticated encrypted hibernation image scheme, and the SUSE
> > developer working on it told me a few months ago that the status is
> > the same as last year and there's no ETA for when he gets the time to
> > revisit it.
>
> Most people do not use Secure Boot, so this is not really relevant ?

Evidence?

Windows 8 hardware certification requires it be enabled by default on
the desktop. And Windows Server 2019 hardware certification requires
it be enabled for servers. It's clearly the default scenario with
current hardware.

Most users are intentionally going into firmware settings and
disabling it? This is esoteric knowledge. But it means they are
already in custom territory, not default territory. So they can just
do custom install and add a swap partition if they really want it.

Also, Red Hat and Fedora have supported UEFI Secure Boot, having
invested a lot of resources to do so, for 8 years. I think it's
difficult to support UEFI Secure Boot out of the box, and suggest this
is a good idea and worthwhile effort, but then we should dismiss it by
allowing loopholes like hibernation images that aren't signed.

> Also having swap on an encrypted dm is not hard, it is just a matter of
> plumbing the reboot correctly to unlock the device that needs to be
> solved, assuming desktop people are willing to.

I'm not sure why encrypted swap is relevant here. It is inadequate for
the Secure Boot case. There is nothing the desktop people can do about
it. Hibernation is inhibited by kernel lockdown policy when SB is
enabled.

In any case, without SB, resume from hibernation on an encrypted
volume doesn't involve the desktop at all. All the correct things
should already exist and be done in the kernel and initramfs.

Although, I'll say, I'm troubleshooting a new hibernation bug today.
And I'm actually slightly pissed off due to the lack of kernel debug
messages. Kernel code shows multiple steps needed for hibernation
entry and exit/resume. And yet no goddamn kernel messages to mark each
of these steps being attempted, if they succeeded or failed, with an
error message if appropriate. Just fucking nothing. It's supremely
badly designed for mortal users to trouble shoot this. They aren't
crazy. Poor hapless users though think this is their fault. It isn't.

> Not saying I want to defend hybernate, it is only marginally useful,
> and generally fails due to broken drivers anyway, so it is safer to
> *not* offer it by default anyway. However it would be nice to make it
> straight forward to re-enable it should a user want to and they are on
> hw that handles it nicely.

It might get better due to the "hibernation in the cloud" work. And it
might get better with qemu-kvm because of this. And I personally
consider that a prerequisite because if the idealized situation with
all the code open source can know about and support doesn't even work
reliably? How the F could it possibly work reliable with all the
hardware out in the wild? Well it can't.

And the reality is, it's doesn't f'n work with qemu-kvm reliably right
now. That might not be, probably isn't, a PM bug. It's probably a qemu
or kvm bug.

But then the next thing is. If we aren't creating swap partitions,
it's non-trivial to give users directions how to create one after the
fact. That implies swapfiles. So we could give users some advice on
how to do that. All file systems support swapfiles. But there's been a
bunch of iomap work lately that might blow up hibernation images
inside the swapfile. Another linuxism is stuffing the hibernation
image into the swap file or partition. So you have to have a way of
figuring out the offset where the hibernation image begins, and
there's no standard interface for that still. So that might need work.
ext4 just switched from bmap to iomap for some things so maybe it'll
be possible to get to a standard interface for this soon (year or
two?) I'm not certain.

And then this is arch specific. It's pretty much only used on x86.

> > I expect in the "already has a swap partition" case, no one will
> > complain about getting a 2nd swap device that's on /dev/zram0, because
> > it'll just be faster and "spill over" to the bigger swap-on-disk.
>
> The problem though is that it will keep more ram busy, it may cause
> OOMs in situation where previously they did not happen.

The only scenario I imagine more likely OOM than without this feature,
is if 100% of anonymous pages are 0% compressible. Even if compressed
is 80% of uncompressed, you're less likely to see OOM.

In fact compared to the noswap case more common on cloud and servers,
you'll see less OOM, in particular less abrupt. And also it's more
efficient because noswap means anonymous pages are pinned since they
cannot be evicted. That means noswap has a kind of "swap thrashing"
which is really reclaiming file pages, since that's the only way to
free memory. So you get this churn of reclaiming libraries or
executable code, only to have to re-read them over and over again, in
the heavy memory demanding workloads.

This feature means inactive anon pages can be "evicted" even though
it's not at a 100% rate like with disk. It's still a significant
improvement than no eviction.

> The problem here would be that it is hard to figure out how to deal
> with zram unless you already know about it. I would be easier to deal
> with it if disablement required simply to remove an entry from fstab.

We could look into inserting a comment into /etc/fstab to point people
to the zram configuration.

There is no fstab entry for this. That's also consistent with how
CoreOS is trending - they are moving toward no fstab at all (maybe
already there) and using only discoverable partitions based on GPT
partition type GUID.

>
> > It's the use case where "I explicitly did not create a swap device
> > because I hate swap thrashing" that I suspect there may be complaints.
> > We can't detect that sentiment. All we could do is just decide that no
> > upgrades get the feature.
>
> or you could decide to get it only if swap is used, and not get it if
> no swap is used.
> After all if a system has no swap you are not going to make it worse by
> not providing zram swap.

This is already a provably suboptimal configuration. And it's
reasonable and fashionable only because of huge swap partitions
intended for hibernation (the 1:1 ratio), and hard drives being slow.
This is so much faster it really does seem like it's witchcraft.

The default configuration is really conservative. And in the case
where there is no anonymous pages for eviction, the ZRAM device
doesn't use memory. It's not a preallocation. It's not a reserved
portion of RAM. It's dynamic.

> You can argue that you *could* make it better in those cases where a
> lot of compression can be achieved.

It's made better even if it's minimally compressible.

> But it is arguably more important to not break existing system than
> improving them.

This is exactly the open question with upgrades. And hence the test
day. In a year of testing, I haven't found such a case. But I am not a
scientific sample.

>
> > > Generally, we should probably assume (given existing defaults) that
> > > anyone who has no swap running chose that explicitly and to change it
> > > would lead to complaints.
> >
> > Perhaps.
> >
> > But I assert their decision is based on both bad information (wrong
> > assumptions), and prior bad experience. Even if in the end the
> > decision is to not apply the feature on upgrades, I think it's worth
> > some arguing to counter wrong assumptions and bad experiences.
>
> Given that, as you said, you cannot divine intention, you do not know
> what intention there was behind a choice.
>
> I run without swap, the reason is that I rather have OOM kick
> misbehaving applications fast than have the system trashing on slow
> media.

I estimate your workload is better off with swap-on-ZRAM with the
propose default, and earlyoom enabled. The result will be more
successful jobs, and a fair likelihood that jobs that would have OOM'd
anyway will oom even faster.

And soon enough we'll likely get to systemd-oomd as a substitute for
earlyoom, and also Server and Cloud leveraging systemd
session/slice/scope to better handle resource control via cgroupsv2,
which work is on-going in GNOME and KDE with really good results so
far. And then maybe later we'll know more about how to dynamically
change the zram device size, adapting for workload.

>
> Sure zram does not have the trashing part, probably, because ram is
> fast, but it also won't really help a lot, because the problem I have
> is runaway processes not just a bit of swap used sometimes. In my case
> zram swap will just delay minimally the OOM that is about to come.

Unlikely, unless you're already using earlyoom. The reason, the kernel
oomkiller only cares about the kernel's survival. If it can keep the
kernel running, it will never kill jobs. Whereas earlyoom, overly
simplistic though it is and not using any PSI information, will kill
these same jobs sooner than the kernel oomkiller.

Arguably Fedora Server could even consider oomd2 instead of earlyoom.
But it implies some better cgroups v2 isolation work in Server, for
server and cloud use cases, similar to what GNOME/KDE folks have been
working on. Facebook folks already have a lot of this work done and
that could probably be looked at and even mimiced for Server and Cloud
editions. The nice thing about earlyoom is it doesn't require that
prior effort, it's drop in. But the next logical step is better
resource control across the board and also add oomd2 to that, where it
catches the rare thing that still goes crazy sometimes.

>
> You can still argue that in your opinion I should use zram swap, but it
> is a matter of opinions and you have less data about my system and how
> I use it than I do. So, sure it is arguable in many case, but yo can't
> argue, so you should take the defensive approach and respect user
> choices. They can easily systemctl their way into enabling zram swap if
> they like.

The zram-generator uses the existence of both generator and a properly
formatted configuration file to setup the swap-on-zram during early
boot. It's not using a systemd unit. The user facing portion of this
is only the configuration file.

I agree the upgrade case for cloud/server is more difficult than
workstation. But I'd prefer to see a Server/Cloud specific
configuration file used with a smaller ZRAM device, instead of not
creating it. Maybe sized to 20% of RAM? i.e. uses 10% of RAM if swap
is full and if compression ratio is 2:1.

> > > * If you're going to do the Supplements:, you need to do `Supplements:
> > > fedora-release-common` or you won't get everyone. The `fedora-release`
> > > package is for non-Edition/Spin installs.
> >
> > Right. Fixed.
> >
> > I suggest these three lines in the configuration (I've updated the
> > change proposal how to test section to include this):
> >
> > [zram0]
> > memory-limit = none
> > zram-fraction = 0.5
>
> Does this mean zram is going to use potentially half of the available
> system ram ?

Yes. If the compression ratio is 1:1, i.e. no compression ratio. This
is also the reason the proposal suggests a cap of 4GiB for /dev/zram0.
Typically a 4GiB /dev/zram0 device means 2GiB of RAM consumption *if*
the swap is completely full.

-- 
Chris Murphy
_______________________________________________
cloud mailing list -- cloud@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to cloud-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/cloud@xxxxxxxxxxxxxxxxxxxxxxx