Re: Fedora 33 System-Wide Change proposal: swap on zram

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Fri, 5 Jun 2020 22:00:43 -0600

On Fri, Jun 5, 2020 at 8:33 PM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Jun 5, 2020 at 8:12 PM Samuel Sieb <samuel@xxxxxxxx> wrote:
> >
> > >> # swapon
> > >> NAME      TYPE      SIZE USED  PRIO
> > >> /dev/sda3 partition  16G 1.9G    -2
> > >> /zram0    partition   4G   4G 32767
> > >>
> > >> This looks like I'm using all 4G of allocated space in the zram swap, but:
> > >>
> > >> # zramctl
> > >> NAME       ALGORITHM DISKSIZE  DATA  COMPR  TOTAL STREAMS MOUNTPOINT
> > >> /dev/zram0 lz4             4G  1.8G 658.5M 679.6M       4
> > >>
> > >> This suggests that it's only using 1.8G.  Can you explain what this means?
> > >
> > > Yeah that's confusing. zramctl just gets info from sysfs, but you
> > > could double check it by
> > >
> > > cat /sys/block/zram0/mm_stat
> > >
> > > The first value should match "DATA" column in zramctl (which reports in MiB).
> > >
> > > While the kernel has long had support for using up to 32 swap devices
> > > at the same time, this is seldom used in practice so it could be an
> > > artifact of this? Indicating that all of this swap is "in use" from
> > > the swap perspective, where zramctl is telling you the truth about
> > > what the zram kernel module is actually using. Is it a cosmetic
> > > reporting bug or intentional? Good question. I'll try to reproduce and
> > > report it upstream and see what they say. But if you beat me to it
> > > that would be great, and then I can just write the email for linux-mm
> > > and cite your bug report. :D
> >
> > Part of my concern is that if it's not actually full, then why is it
> > using so much of the disk swap?

OK I can't explain what you're seeing because I'm not certain of the
workload. Here's what I've got going on.

F32, 5.7.0-fc33 kernel, 8G RAM, 4G swaponzram, 10G swapondrive. With
swaponzram higher priority before doing any swapping.

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7864        1395        4251          60        2216        6122
Swap:         14559           0       14559
$ swapon
NAME       TYPE       SIZE USED PRIO
/dev/sda5  partition 10.4G   0B   -2
/dev/zram0 partition  3.9G   0B    3
$ zramctl
NAME       ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT
/dev/zram0 lzo-rle       3.9G   4K   74B   12K       8 [SWAP]

 Then build webkitgtk using 'ninja -j10'

What happens? Only the swaponzram0 is used for a while. It fills up
and then swaponsda5 starts being used. But also, the used on swapzram0
goes down and up and down and up. During this time, swaponsda5 mostly
doesn't change. Sometimes it goes down. But it only goes back up again
if swaponzram0 is already full.

I think this is working as I expect because once anonymous pages are
in either swap, they don't migrate between the swaps. But anonymous
pages can always be deallocated from either swap at any time leading
to the appearance that zram0 isn't being used to the max - well
because it's not.

Here is an example.

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7864        5613        1945          63         305        1929
Swap:         14559        4745        9814
$ swapon
NAME       TYPE       SIZE USED PRIO
/dev/sda5  partition 10.4G 1.9G   -2
/dev/zram0 partition  3.9G 2.6G    3
$ zramctl
NAME       ALGORITHM DISKSIZE  DATA  COMPR  TOTAL STREAMS MOUNTPOINT
/dev/zram0 lzo-rle       3.9G  2.4G 582.1M 877.6M       8 [SWAP]
$

The gap between COMPR and TOTAL might seem big. And in fact it might
be fragmentation not metadata overhead, as was earlier suggested. But
it changes a lot and fast with this workload. Just a couple minutes
later.

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7864        7602         125           1         136          56
Swap:         14559        7342        7216
$ swapon
NAME       TYPE       SIZE USED PRIO
/dev/sda5  partition 10.4G 3.4G   -2
/dev/zram0 partition  3.9G 3.9G    3
$ zramctl
NAME       ALGORITHM DISKSIZE  DATA  COMPR TOTAL STREAMS MOUNTPOINT
/dev/zram0 lzo-rle       3.9G  3.9G 913.1M  954M       8 [SWAP]
$

I'm getting 4:1 compression ratio with this workload, by the way. It's
so far not using more than 1GiB RAM to save me 4G. Or a net savings of
3G that is regular memory. But more importantly than the compression?
The fact 4GiB did not need to page out and back in from SSD. And in
fact as the workload progresses, it's saving quite a lot more than 4G
of pageouts to disk - I just don't have a cumulative value. Also? The
workload has a high wait state for CPU when it's IO bound, waiting on
the drive, even though it's an SSD. The zram based swap is only as
smart as the workload is at properly deallocating things that it no
longer needs. If it's holding onto anonymous pages and they're in the
swap-on-zram device, then that's it, back to disk swapping only.

I haven't yet seen swapon claim swap was full but then zramctl say it
wasn't. But this changes fast, depending on the workload. And also, it
might be true there's some latency in the reporting between them.

-- 
Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx