On Mon, Jun 1, 2020 at 12:44 PM Simo Sorce <simo@xxxxxxxxxx> wrote: > > On Mon, 2020-06-01 at 10:37 -0600, Chris Murphy wrote: > > Thanks for the early feedback! > > > > On Mon, Jun 1, 2020 at 7:58 AM Stephen Gallagher <sgallagh@xxxxxxxxxx> wrote: > > > * Reading through the Change, you write: > > > "using a ZRAM to RAM ratio of 1:2, and capped† to 4GiB" and then you > > > talk about examples which are using 50% of RAM as ZRAM. Which is it? A > > > ratio of 1:2 implies using 33% of RAM as ZRAM. > > > > This ratio is just a fraction, part of whole, where RAM is the whole. > > This convention is used in the zram (package). > > > > Note that /dev/zram0 is a virtual block device, similar to the > > 'lvcreate -V' option for thin volumes, size is a fantasy. And the ZRAM > > device size is not a preallocation of memory. If the compression ratio > > 2:1 (i.e. 200%) holds, then a ZRAM device sized to 50% of RAM will not > > use more than 25% of RAM. > > What happen if you can't compress memory at all ? > Will zram use more memory? Or will it simply become useless (but > hopefully harmless) churn ? It is not a no op. There is CPU and memory consumption in this case. It actually reduces available memory for the workload. I haven't yet seen this in practice and haven't come up with a synthetic test - maybe something that just creates a bunch of anonymous pages using /dev/urandom? Since the device by default is small, it ends up performance wise being a no op. You either arrive at the same oom you would have, in the no swap at all case; or it'll just start spilling over into a swap-on-disk if you have one still. I have done quite a lot of testing of the webkit gtk compile case, where it uses ncpus + 2 for the number of jobs by default, and where it gets to a point eventually needing up to 1.5 GiB per job. Super memory hungry. 8 GiB RAM + no swap = this sometimes triggers kernel oomkiller quickly, but sometimes just sits there for a really long time before it triggers, it does always eventually trigger. With earlyoom enabled (default on Workstation) the oom happens faster, usually within 5 minutes. 8 GiB RAM + 8 GiB swap-on-disk = this sometimes but far less often results in kernel oomkiller trigger; most often it sits in pageout/pagein for 30+ minutes with a totally frozen GUI. With earlyoom enabled, is consistently killed inside of 10 minutes. 8 GiB RAM + 8 GiB swap-on-ZRAM = exact reverse: sometimes but less often results in 30+ minute hangs with frozen GUI, usually results in kernel oom killer within 5 minutes. With earlyoom enabled consistently is killed inside of 5 minutes. 8 GiB RAM + 16 GiB swap-on-disk = consistently finishes the compile. 8 GiB RAM + 16 GiB swap-on-ZRAM = my log doesn't have this test. I thought I had done it. But I think it's a risky default configuration because it's if you don't get 2:1 compression and the task really needs this much RAM, it's not just IO churn like with a disk based swap. It's memory and CPU, and if it gets wedged in, it's a forced power off. That's basically where we're at with Workstation edition before earlyoom, which is not good, but not a huge problem like it is with servers where you have to send someone to go hit a power button. In these cases, sshd often will not respond before timeout. So no sysrq+b unless you have that command pretyped out and ready to hit enter. The scenario where you just don't have the budget for the correct memory for the workload, and you have to use swap contrary to the "in defense of swap" article referenced in the change? I think it's maybe better use case for zswap? I don't have tests that conclusively prove that zswap's LRU basis for eviction from the zswap memory pool to the disk swap is better than how the kernel deals with two swaps (zram and disk case). But in theory the LRU basis is smarter. Making it easier for folks to experiment with this I think is maybe undersold in the proposal. But the main idea is to convey that the proposed defaults are safe. Later in the proposal I propose they might be too safe, with the 4GiB cap. That might be refused in favor of 50% RAM across the board. But that could be a future enhancement if this proposal is accepted. > > > I'll try to clear this up somehow; probably avoid using the term ratio > > and just go with fraction/percentage. And also note the use of > > 'zramctl' to see the actual compression ratio. > > > > > * This Change implies the de facto death of hibernation in Fedora. > > > Good riddance, IMHO. It never worked safely. > > > > UEFI Secure Boot put us on this path. There's still no acceptable > > authenticated encrypted hibernation image scheme, and the SUSE > > developer working on it told me a few months ago that the status is > > the same as last year and there's no ETA for when he gets the time to > > revisit it. > > Most people do not use Secure Boot, so this is not really relevant ? Evidence? Windows 8 hardware certification requires it be enabled by default on the desktop. And Windows Server 2019 hardware certification requires it be enabled for servers. It's clearly the default scenario with current hardware. Most users are intentionally going into firmware settings and disabling it? This is esoteric knowledge. But it means they are already in custom territory, not default territory. So they can just do custom install and add a swap partition if they really want it. Also, Red Hat and Fedora have supported UEFI Secure Boot, having invested a lot of resources to do so, for 8 years. I think it's difficult to support UEFI Secure Boot out of the box, and suggest this is a good idea and worthwhile effort, but then we should dismiss it by allowing loopholes like hibernation images that aren't signed. > Also having swap on an encrypted dm is not hard, it is just a matter of > plumbing the reboot correctly to unlock the device that needs to be > solved, assuming desktop people are willing to. I'm not sure why encrypted swap is relevant here. It is inadequate for the Secure Boot case. There is nothing the desktop people can do about it. Hibernation is inhibited by kernel lockdown policy when SB is enabled. In any case, without SB, resume from hibernation on an encrypted volume doesn't involve the desktop at all. All the correct things should already exist and be done in the kernel and initramfs. Although, I'll say, I'm troubleshooting a new hibernation bug today. And I'm actually slightly pissed off due to the lack of kernel debug messages. Kernel code shows multiple steps needed for hibernation entry and exit/resume. And yet no goddamn kernel messages to mark each of these steps being attempted, if they succeeded or failed, with an error message if appropriate. Just fucking nothing. It's supremely badly designed for mortal users to trouble shoot this. They aren't crazy. Poor hapless users though think this is their fault. It isn't. > Not saying I want to defend hybernate, it is only marginally useful, > and generally fails due to broken drivers anyway, so it is safer to > *not* offer it by default anyway. However it would be nice to make it > straight forward to re-enable it should a user want to and they are on > hw that handles it nicely. It might get better due to the "hibernation in the cloud" work. And it might get better with qemu-kvm because of this. And I personally consider that a prerequisite because if the idealized situation with all the code open source can know about and support doesn't even work reliably? How the F could it possibly work reliable with all the hardware out in the wild? Well it can't. And the reality is, it's doesn't f'n work with qemu-kvm reliably right now. That might not be, probably isn't, a PM bug. It's probably a qemu or kvm bug. But then the next thing is. If we aren't creating swap partitions, it's non-trivial to give users directions how to create one after the fact. That implies swapfiles. So we could give users some advice on how to do that. All file systems support swapfiles. But there's been a bunch of iomap work lately that might blow up hibernation images inside the swapfile. Another linuxism is stuffing the hibernation image into the swap file or partition. So you have to have a way of figuring out the offset where the hibernation image begins, and there's no standard interface for that still. So that might need work. ext4 just switched from bmap to iomap for some things so maybe it'll be possible to get to a standard interface for this soon (year or two?) I'm not certain. And then this is arch specific. It's pretty much only used on x86. > > I expect in the "already has a swap partition" case, no one will > > complain about getting a 2nd swap device that's on /dev/zram0, because > > it'll just be faster and "spill over" to the bigger swap-on-disk. > > The problem though is that it will keep more ram busy, it may cause > OOMs in situation where previously they did not happen. The only scenario I imagine more likely OOM than without this feature, is if 100% of anonymous pages are 0% compressible. Even if compressed is 80% of uncompressed, you're less likely to see OOM. In fact compared to the noswap case more common on cloud and servers, you'll see less OOM, in particular less abrupt. And also it's more efficient because noswap means anonymous pages are pinned since they cannot be evicted. That means noswap has a kind of "swap thrashing" which is really reclaiming file pages, since that's the only way to free memory. So you get this churn of reclaiming libraries or executable code, only to have to re-read them over and over again, in the heavy memory demanding workloads. This feature means inactive anon pages can be "evicted" even though it's not at a 100% rate like with disk. It's still a significant improvement than no eviction. > The problem here would be that it is hard to figure out how to deal > with zram unless you already know about it. I would be easier to deal > with it if disablement required simply to remove an entry from fstab. We could look into inserting a comment into /etc/fstab to point people to the zram configuration. There is no fstab entry for this. That's also consistent with how CoreOS is trending - they are moving toward no fstab at all (maybe already there) and using only discoverable partitions based on GPT partition type GUID. > > > It's the use case where "I explicitly did not create a swap device > > because I hate swap thrashing" that I suspect there may be complaints. > > We can't detect that sentiment. All we could do is just decide that no > > upgrades get the feature. > > or you could decide to get it only if swap is used, and not get it if > no swap is used. > After all if a system has no swap you are not going to make it worse by > not providing zram swap. This is already a provably suboptimal configuration. And it's reasonable and fashionable only because of huge swap partitions intended for hibernation (the 1:1 ratio), and hard drives being slow. This is so much faster it really does seem like it's witchcraft. The default configuration is really conservative. And in the case where there is no anonymous pages for eviction, the ZRAM device doesn't use memory. It's not a preallocation. It's not a reserved portion of RAM. It's dynamic. > You can argue that you *could* make it better in those cases where a > lot of compression can be achieved. It's made better even if it's minimally compressible. > But it is arguably more important to not break existing system than > improving them. This is exactly the open question with upgrades. And hence the test day. In a year of testing, I haven't found such a case. But I am not a scientific sample. > > > > Generally, we should probably assume (given existing defaults) that > > > anyone who has no swap running chose that explicitly and to change it > > > would lead to complaints. > > > > Perhaps. > > > > But I assert their decision is based on both bad information (wrong > > assumptions), and prior bad experience. Even if in the end the > > decision is to not apply the feature on upgrades, I think it's worth > > some arguing to counter wrong assumptions and bad experiences. > > Given that, as you said, you cannot divine intention, you do not know > what intention there was behind a choice. > > I run without swap, the reason is that I rather have OOM kick > misbehaving applications fast than have the system trashing on slow > media. I estimate your workload is better off with swap-on-ZRAM with the propose default, and earlyoom enabled. The result will be more successful jobs, and a fair likelihood that jobs that would have OOM'd anyway will oom even faster. And soon enough we'll likely get to systemd-oomd as a substitute for earlyoom, and also Server and Cloud leveraging systemd session/slice/scope to better handle resource control via cgroupsv2, which work is on-going in GNOME and KDE with really good results so far. And then maybe later we'll know more about how to dynamically change the zram device size, adapting for workload. > > Sure zram does not have the trashing part, probably, because ram is > fast, but it also won't really help a lot, because the problem I have > is runaway processes not just a bit of swap used sometimes. In my case > zram swap will just delay minimally the OOM that is about to come. Unlikely, unless you're already using earlyoom. The reason, the kernel oomkiller only cares about the kernel's survival. If it can keep the kernel running, it will never kill jobs. Whereas earlyoom, overly simplistic though it is and not using any PSI information, will kill these same jobs sooner than the kernel oomkiller. Arguably Fedora Server could even consider oomd2 instead of earlyoom. But it implies some better cgroups v2 isolation work in Server, for server and cloud use cases, similar to what GNOME/KDE folks have been working on. Facebook folks already have a lot of this work done and that could probably be looked at and even mimiced for Server and Cloud editions. The nice thing about earlyoom is it doesn't require that prior effort, it's drop in. But the next logical step is better resource control across the board and also add oomd2 to that, where it catches the rare thing that still goes crazy sometimes. > > You can still argue that in your opinion I should use zram swap, but it > is a matter of opinions and you have less data about my system and how > I use it than I do. So, sure it is arguable in many case, but yo can't > argue, so you should take the defensive approach and respect user > choices. They can easily systemctl their way into enabling zram swap if > they like. The zram-generator uses the existence of both generator and a properly formatted configuration file to setup the swap-on-zram during early boot. It's not using a systemd unit. The user facing portion of this is only the configuration file. I agree the upgrade case for cloud/server is more difficult than workstation. But I'd prefer to see a Server/Cloud specific configuration file used with a smaller ZRAM device, instead of not creating it. Maybe sized to 20% of RAM? i.e. uses 10% of RAM if swap is full and if compression ratio is 2:1. > > > * If you're going to do the Supplements:, you need to do `Supplements: > > > fedora-release-common` or you won't get everyone. The `fedora-release` > > > package is for non-Edition/Spin installs. > > > > Right. Fixed. > > > > I suggest these three lines in the configuration (I've updated the > > change proposal how to test section to include this): > > > > [zram0] > > memory-limit = none > > zram-fraction = 0.5 > > Does this mean zram is going to use potentially half of the available > system ram ? Yes. If the compression ratio is 1:1, i.e. no compression ratio. This is also the reason the proposal suggests a cap of 4GiB for /dev/zram0. Typically a 4GiB /dev/zram0 device means 2GiB of RAM consumption *if* the swap is completely full. -- Chris Murphy _______________________________________________ cloud mailing list -- cloud@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to cloud-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/cloud@xxxxxxxxxxxxxxxxxxxxxxx