Re: Fedora 33 System-Wide Change proposal: swap on zram

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Sun, 7 Jun 2020 17:25:15 -0600

On Sun, Jun 7, 2020 at 2:48 PM David Kaufmann <astra@xxxxxxxx> wrote:
>
> On Sat, Jun 06, 2020 at 05:36:15PM -0600, Chris Murphy wrote:
> > To me this sounds like too much dependency on swap.
>
> That's not what I meant, I wanted to emphasize the different values of
> disk storage vs. RAM. As said in another email it doesn't matter at all
> if there is 0% or 90% of disk swap usage, while RAM usage can be quite
> essential. (This is in case swapped out stuff stays swapped out.)

Inactive pages that are evicted long term, is a workload that I think
would benefit from zswap instead. In that case you get the benefit of
the memory cache for recently used anonymous pages that would
otherwise result in "swap thrashing" and the "least recently used"
pages are moved to disk based swap.

The inherent difficulty with optimizations, is trying to find a
generic approach that helps most use cases. Is this a 100% winner? I
doubt it. Is it an 80% winner across all of Fedora? I think it's at
least that but sure, I can't prove it empirically. There's quite a lot
of evidence it's sane considering all the use cases it's already been
used in.

>
> > What people hate is slow swap.
>
> This is not generally true, only if RAM gets so tight that applications
> start competing for swap.
> This is why I've proposed test cases testing exactly that, as for
> the case of persistent swap I'd expect the outcome to be a clear win for
> disk swap. (Although this can in some cases also be seen as bug, as this
> would be applications not really using the allocated space)

I don't follow this. Where are the proposed test cases? And also in
what case are you saying disk swap is a clear win? Because I would
consider such an example an optimization for that specific edge case,
rather than a generic solution. We've had that as a generic solution
for a while and it's causing some grief for folks where there is
memory competition among applications and those pages need to be
evicted and then not long after paged in - which causes the swap
thrashing effect.

Arguably they need more memory for their workload. But that's in
effect what the feature does. It gives them more bandwidth for
frequently used anonymous pages being paged in and out via compressed
memory rather than significantly slower disk swap.

Is this free? Well, it's free in that it's not out of pocket cost for
more RAM. Instead it exchanges some CPU to make extra room in existing
memory and not have the explosively high latency of disk swap give
them a bad experience.

>
> > For sure there is an impact on CPU. This exchanges IO bound work, for
> > CPU and memory bound work. But it's pretty lightweight compression.
> >
> > And again, whatever is defined as "too much" CPU hit for the workload,
> > necessarily translates into either "suffer the cost of IO bound
> > swap-on-drive, or suffer the cost of more memory." There is no free
> > lunch. It really isn't magic.
>
> Yes, that seems obvious to me. What would be interesting is the point,
> where one is significantly slower than the other one.
> The theoretical testcase is writing data to memory and reading it again.
> For this case I'm assuming 8G RAM as total memory.
>
> Until about 95% mem usage I'd expect the disk swap case to win, as it
> should behave the same as no swap (with matching swappiness values)

Why would disk based swap win? In this example, where there's been no
page outs, the zram device isn't using any memory. Again, it is not a
preallocation.

> At 150% memory usage assuming a 2:1 compression ratio this would mean:
> - disk swap:
>   has to write 4G to disk initially, and for reading swap another 4G
>   (12G total traffic - 4G initial, 4G swapping out and 4G swapping in)
> - zram, assuming 4G zram swap:
>   has to write 8G to zram initially, and for reading the data swap 16G
>   (24G total traffic - 8G initial, 8G swapping out and 8G swapping in)

swap contains anonymous pages, so I'm not sure what you mean by
initial. Whether these pages are internet or typed in or come from
persistent storage - it's a wash between disk or zram swap so it can
be ignored.

Also I don't understand any of your math,how you start with a 4G zram
swap but have 8G. I think you're confused. The cap of 4GiB is the
device size. The actual amount of RAM it uses will be less due to
compression. The zram device size is not the amount of memory used.
And in no case is there a preallocation of memory unless the zram
device is used. It is easy to get confused, by the way. That was my
default state for days upon first stumbling on this.

> It would be good to see actual numbers for this, so far I've only
> seen praises on how well the compression ratio is. (Plus the anecdotal
> references from a few people)

There's a lot of use cases using this in the real world: Chrome,
Android, Fedora IoT, Fedora ARM spins, most all of openQA VM's doing
Anaconda installation tests are taking advantage of it.

> But this should also be tested with actual CPUs and disks.

I've been doing it for a year on four separate systems. I am not a
scientific sample. But this is what I'm able to do.

>zram is
> obviously faster, but at which point is the overhead from compression,
> the reduced unswapped memory and the doubled number of swapping operations
> starting to be smaller than the overhead from SSD read/write speed?

I have definitely seen behavior that sounds like this. That's the case of:

8G RAM + 8G swaponzram (i.e. 100% sized to RAM)
versus
8G RAM + 8G swap on SSD

And then compile webkitgtk using ninja default, which on my system is
10 jobs. The second one always becomes completely unresponsive and I
do a forced power off at 30 minutes. (I have a few cases of 4+ hour
compiles, none finished, some OOM. I have many, as in over 100, cases
of forced power off varying from 1-30 minutes.)

The first one, with zram, more often than not, ends with OOM inside of
30 minutes. I'd have to dig up hand written logs to see if there's any
pattern how long it takes, I think it's around 10 minutes but human
memory isn't exactly reliable so take this with a grain of salt. A
smaller number of times, the system is in a "CPU+memory" based swap
thrash. Approximately as you describe, it's probably just wedged in
making very slow progress because perhaps up to 1/3 or 1/2 of RAM is
being used for the zram device. And the compile flat out wants more
memory than is available.

This task only succeeds with ~12+G of disk based swap. Which is just
not realistic. It's a clearly overcommitted and thus contrived test.
But I love it and hate it at the same time. More realistic is to not
use defaults, and set the number of jobs manually to 6. And in this
case, zram based swap consistently beats disk based swap. Which makes
sense because pretty much all of the inactive pages are going to be
needed at some point by the compile or they are dropped. Following the
compile there aren't a lot of inactive pages left, and I'm not sure
they're even related to the compile at all.

> Is this almost immediately the case or is this only closely before being
> OOM anyway?
> The "too much CPU" limit would be the actual wallclock time testprograms
> take without hitting OOM. If a program using 120% memory takes 90
> seconds to complete its run with swap, and 60 seconds with zram swap,
> that would be an improvement. If it's 120 seconds the most likely issue
> is "too much CPU used for compression or swapping".

Sure and I think any person is going to notice this kind of latency
without even wall clock timing it. But anyway I time my compiles using
the time command.

> > There are worse things than OOM. Stuck with a totally unresponsive
> > system and no OOM on the way. Hence earlyoom. And on-going resource
> > control work with cgroupsv2 isolation.
>
> This is true boxes where the offending processes are not under manual
> control, where it's better that any exploding program is being
> terminated as soon as possible.

Even under manual control we've got examples of the GUI becoming
completely stuck. Long threads in devel@ based on this Workstation
working group issue - with the same name. So just search archives for
interactivity. Or maybe webkitgtk.

#98 Better interactivity in low-memory situations
https://pagure.io/fedora-workstation/issue/98

> It's exactly the other way round for manual controlled processes, as a
> slowdown before getting to OOM is sometimes enough to be able to decide
> what to free up/terminate, before OOM-Killer just goes in brute-force.
> That doesn't work too well nowadays, as quite often the swap on disk
> fills too fast on SSDs before I've got time to kill something.

earlyoom will kill in such a case even if you can't. It's configurable
and intentionally simplistic, based on memory and swap free
percentage.

-- 
Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx