Re: Fedora 34 Change: Enable btrfs transparent zstd compression by default (System-Wide Change proposal)

Jeremy Linton <jeremy.linton@xxxxxxx> · Sat, 13 Feb 2021 22:45:13 -0600

Hi,

On 2/11/21 11:05 PM, Chris Murphy wrote:
On Thu, Feb 11, 2021 at 9:58 AM Jeremy Linton <jeremy.linton@xxxxxxx> wrote:
Hi,

On 1/1/21 8:59 PM, Chris Murphy wrote:
Anyway, compress=zstd:1 is a good default. Everyone benefits, and I'm
not even sure someone with a very fast NVMe drive will notice a slow
down because the compression/decompression is threaded.
I disagree that everyone benefits. Any read latency sensitive workload
will be slower due to the application latency being both the drive
latency plus the decompression latency. And as the kernel benchmarks
indicate very few systems are going to get anywhere near the performance
of even baseline NVMe drives when its comes to throughput.
It's possible some workloads on NVMe might have faster reads or writes
without compression.

https://github.com/facebook/zstd

btrfs compress=zstd:1 translates into zstd -1 right now; there are
some ideas to remap btrfs zstd:1 to one of the newer zstd --fast
options, but it's just an idea. And in any case the default for btrfs
and zstd will remain as 3 and -3 respectively, which is what
'compress=zstd' maps to, making it identical to 'compress=zstd:3'.

I have a laptop with NVMe and haven't come across such a workload so
far, but this is obviously not a scientific sample. I think you'd need
a process that's producing read/write rates that the storage can meet,
but that the compression algorithm limits. Btrfs is threaded, as is
the compression.

What's typical, is no change in performance and sometimes a small
small increase in performance. It essentially trades some CPU cycles
in exchange for less IO. That includes less time reading and writing,
but also less latency, meaning the gain on rotational media is
greater.

Worse, if the workload is very parallel, and at max CPU already
the compression overhead will only make that situation worse as well. (I
suspect you could test this just by building some packages that have
good parallelism during the build).
This is compiling the kernel on a 4/8-core CPU (circa 2011) using make
-j8, the kernel running is 5.11-rc7.

no compression

real    55m32.769s
user    369m32.823s
sys     35m59.948s

------

compress=zstd:1

real    53m44.543s
user    368m17.614s
sys     36m2.505s

That's a one time test, and it's a ~3% improvement. *shrug* We don't
really care too much these days about 1-3% differences when doing
encryption, so I think this is probably in that ballpark, even if it
turns out another compile is 3% slower. This is not a significantly
read or write centric workload, it's mostly CPU. So this 3% difference
may not even be related to the compression.
Did you drop caches/etc between runs? Because I git cloned mainline, 
copied the fedora kernel config from /boot and on a fairly recent laptop 
(12 threads) with a software encrypted NVMe. Dropped caches and did a 
time make against a compressed directory and an uncompressed one with 
both a semi constrained (4G) setup and 32G ram setup (compressed 
swapping disabled, because the machine has an encrypted swap for 
hibernation and crashdumps).
compressed:
real    22m40.129s
user    221m9.816s
sys     23m37.038s

uncompressed:
real    21m53.366s
user    221m56.714s
sys     23m39.988s

uncompressed 4G ram:
real    28m48.964s
user    288m47.569s
sys     30m43.957s

compressed 4G
real    29m54.061s
user    281m7.120s
sys     29m50.613s

and that is not an IO constrained workload its generally cpu 
constrained, and since the caches are warm due to the software 
encryption the decompress times should be much faster than machines that 
aren't cache stashing.
The machine above, can actually peg all 6 cores until it hits thermal 
limits simply doing cp's with btrfs/zstd compression, all the while 
losing about 800MB/sec of IO bandwidth over the raw disk. Turning an IO 
bound problem into a CPU bound one isn't ideal.
Compressed disks only work in the situation where the CPUs can 
compress/decompress faster than the disk, or the workload is managing to 
significantly reduce IO because the working set is streaming rather than 
random. Any workload which has a random read component to it and is 
tending closer to page sized read/writes is going to get hurt, and god 
help if its a RMW cycle. Similarly for parallelized compression, which 
is only scalable if the IO sizes are large enough that its worth the IPI 
overhead of bringing additional cores online and the resulting chunks 
are still large enough to be dealt with individually.

Plus, the write amplification comment isn't even universal as there
continue to be controllers where the flash translation layer is
compressing the data.
At least consumer SSDs tend to just do concurrent write dedup. File
system compression isn't limited to Btrfs, there's also F2FS
contributed by Samsung which implements compression these days as
well, although they commit to it at mkfs time, where as on Btrfs it's
a mount option. Mix and match compressed extents is routine on Btrfs
anyway, so there's no concern with users mixing things up. They can
change the compression level and even the algorithm with impunity,
just tacking it onto a remount command. It's not even necessary to
reboot.

OTOH, it makes a lot more sense on a lot of these arm/sbc boards
utilizing MMC because the disks are so slow. Maybe if something like
this were made the default the machine should run a quick CPU
compress/decompress vs IO speed test and only enable compression if the
compress/decompress speed is at least the IO rate.
It's not that simple because neither the user space writers nor
kworkers are single threaded. You'd need a particularly fast NVMe
matched with a not so fast CPU with a workload that somehow dumps a
lot of data in a way that the compression acts as a bottle neck.

It could exist. But it's not a per se problem that I've seen. But if
you propose a test, I can do A/B testing.

--
Chris Murphy

_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure