Re: Fedora 33 System-Wide Change proposal: swap on zram

David Kaufmann <astra@xxxxxxxx> · Sun, 7 Jun 2020 22:15:38 +0200

On Sat, Jun 06, 2020 at 05:36:15PM -0600, Chris Murphy wrote:
> To me this sounds like too much dependency on swap.

That's not what I meant, I wanted to emphasize the different values of
disk storage vs. RAM. As said in another email it doesn't matter at all
if there is 0% or 90% of disk swap usage, while RAM usage can be quite
essential. (This is in case swapped out stuff stays swapped out.)

> What people hate is slow swap.

This is not generally true, only if RAM gets so tight that applications
start competing for swap.
This is why I've proposed test cases testing exactly that, as for
the case of persistent swap I'd expect the outcome to be a clear win for
disk swap. (Although this can in some cases also be seen as bug, as this
would be applications not really using the allocated space)

> For sure there is an impact on CPU. This exchanges IO bound work, for
> CPU and memory bound work. But it's pretty lightweight compression.
> 
> And again, whatever is defined as "too much" CPU hit for the workload,
> necessarily translates into either "suffer the cost of IO bound
> swap-on-drive, or suffer the cost of more memory." There is no free
> lunch. It really isn't magic.

Yes, that seems obvious to me. What would be interesting is the point,
where one is significantly slower than the other one.
The theoretical testcase is writing data to memory and reading it again.
For this case I'm assuming 8G RAM as total memory.

Until about 95% mem usage I'd expect the disk swap case to win, as it
should behave the same as no swap (with matching swappiness values)

At 150% memory usage assuming a 2:1 compression ratio this would mean:
- disk swap:
  has to write 4G to disk initially, and for reading swap another 4G
  (12G total traffic - 4G initial, 4G swapping out and 4G swapping in)
- zram, assuming 4G zram swap:
  has to write 8G to zram initially, and for reading the data swap 16G
  (24G total traffic - 8G initial, 8G swapping out and 8G swapping in)

It would be good to see actual numbers for this, so far I've only
seen praises on how well the compression ratio is. (Plus the anecdotal
references from a few people)
But this should also be tested with actual CPUs and disks. zram is
obviously faster, but at which point is the overhead from compression,
the reduced unswapped memory and the doubled number of swapping operations
starting to be smaller than the overhead from SSD read/write speed?
Is this almost immediately the case or is this only closely before being
OOM anyway?
The "too much CPU" limit would be the actual wallclock time testprograms
take without hitting OOM. If a program using 120% memory takes 90
seconds to complete its run with swap, and 60 seconds with zram swap,
that would be an improvement. If it's 120 seconds the most likely issue
is "too much CPU used for compression or swapping".

> One thing this might alter the calculus of is swappiness. Because the
> zram device is so much faster, page out page in is a lower penalty
> than file reclaim reads. So now instead of folks thinking swappiness
> should be 1 (or even 0), it's the opposite. It probably should be 100.
> 
> See the swappiness section in the Chris Down article referenced in the proposal:
> https://chrisdown.name/2018/01/02/in-defence-of-swap.html

This article states that setting swappiness to 100 could already be
working fine on SSDs. But yes, swappiness definitely has an influence on
this. I assume testing the edgecases (something around 2 and something
around 100) should be enough.

> There are worse things than OOM. Stuck with a totally unresponsive
> system and no OOM on the way. Hence earlyoom. And on-going resource
> control work with cgroupsv2 isolation.

This is true boxes where the offending processes are not under manual
control, where it's better that any exploding program is being
terminated as soon as possible.

It's exactly the other way round for manual controlled processes, as a
slowdown before getting to OOM is sometimes enough to be able to decide
what to free up/terminate, before OOM-Killer just goes in brute-force.
That doesn't work too well nowadays, as quite often the swap on disk
fills too fast on SSDs before I've got time to kill something.

Usecase: I've got two browsers running, and I'm working in something
memory intensive in one of them. When experiencing slowdown I'd like to
kill the other one, and not terminate the more memory hungry where I'm
currently working at.

All the best,
David
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx