Hi,
On 2/11/21 11:05 PM, Chris Murphy wrote:
On Thu, Feb 11, 2021 at 9:58 AM Jeremy Linton <jeremy.linton@xxxxxxx> wrote:
Hi,
On 1/1/21 8:59 PM, Chris Murphy wrote:
Anyway, compress=zstd:1 is a good default. Everyone benefits, and I'm
not even sure someone with a very fast NVMe drive will notice a slow
down because the compression/decompression is threaded.
I disagree that everyone benefits. Any read latency sensitive workload
will be slower due to the application latency being both the drive
latency plus the decompression latency. And as the kernel benchmarks
indicate very few systems are going to get anywhere near the performance
of even baseline NVMe drives when its comes to throughput.
It's possible some workloads on NVMe might have faster reads or writes
without compression.
https://github.com/facebook/zstd
btrfs compress=zstd:1 translates into zstd -1 right now; there are
some ideas to remap btrfs zstd:1 to one of the newer zstd --fast
options, but it's just an idea. And in any case the default for btrfs
and zstd will remain as 3 and -3 respectively, which is what
'compress=zstd' maps to, making it identical to 'compress=zstd:3'.
I have a laptop with NVMe and haven't come across such a workload so
far, but this is obviously not a scientific sample. I think you'd need
a process that's producing read/write rates that the storage can meet,
but that the compression algorithm limits. Btrfs is threaded, as is
the compression.
What's typical, is no change in performance and sometimes a small
small increase in performance. It essentially trades some CPU cycles
in exchange for less IO. That includes less time reading and writing,
but also less latency, meaning the gain on rotational media is
greater.
Worse, if the workload is very parallel, and at max CPU already
the compression overhead will only make that situation worse as well. (I
suspect you could test this just by building some packages that have
good parallelism during the build).
This is compiling the kernel on a 4/8-core CPU (circa 2011) using make
-j8, the kernel running is 5.11-rc7.
no compression
real 55m32.769s
user 369m32.823s
sys 35m59.948s
------
compress=zstd:1
real 53m44.543s
user 368m17.614s
sys 36m2.505s
That's a one time test, and it's a ~3% improvement. *shrug* We don't
really care too much these days about 1-3% differences when doing
encryption, so I think this is probably in that ballpark, even if it
turns out another compile is 3% slower. This is not a significantly
read or write centric workload, it's mostly CPU. So this 3% difference
may not even be related to the compression.
Did you drop caches/etc between runs? Because I git cloned mainline,
copied the fedora kernel config from /boot and on a fairly recent laptop
(12 threads) with a software encrypted NVMe. Dropped caches and did a
time make against a compressed directory and an uncompressed one with
both a semi constrained (4G) setup and 32G ram setup (compressed
swapping disabled, because the machine has an encrypted swap for
hibernation and crashdumps).
compressed:
real 22m40.129s
user 221m9.816s
sys 23m37.038s
uncompressed:
real 21m53.366s
user 221m56.714s
sys 23m39.988s
uncompressed 4G ram:
real 28m48.964s
user 288m47.569s
sys 30m43.957s
compressed 4G
real 29m54.061s
user 281m7.120s
sys 29m50.613s
and that is not an IO constrained workload its generally cpu
constrained, and since the caches are warm due to the software
encryption the decompress times should be much faster than machines that
aren't cache stashing.
The machine above, can actually peg all 6 cores until it hits thermal
limits simply doing cp's with btrfs/zstd compression, all the while
losing about 800MB/sec of IO bandwidth over the raw disk. Turning an IO
bound problem into a CPU bound one isn't ideal.
Compressed disks only work in the situation where the CPUs can
compress/decompress faster than the disk, or the workload is managing to
significantly reduce IO because the working set is streaming rather than
random. Any workload which has a random read component to it and is
tending closer to page sized read/writes is going to get hurt, and god
help if its a RMW cycle. Similarly for parallelized compression, which
is only scalable if the IO sizes are large enough that its worth the IPI
overhead of bringing additional cores online and the resulting chunks
are still large enough to be dealt with individually.
Plus, the write amplification comment isn't even universal as there
continue to be controllers where the flash translation layer is
compressing the data.
At least consumer SSDs tend to just do concurrent write dedup. File
system compression isn't limited to Btrfs, there's also F2FS
contributed by Samsung which implements compression these days as
well, although they commit to it at mkfs time, where as on Btrfs it's
a mount option. Mix and match compressed extents is routine on Btrfs
anyway, so there's no concern with users mixing things up. They can
change the compression level and even the algorithm with impunity,
just tacking it onto a remount command. It's not even necessary to
reboot.
OTOH, it makes a lot more sense on a lot of these arm/sbc boards
utilizing MMC because the disks are so slow. Maybe if something like
this were made the default the machine should run a quick CPU
compress/decompress vs IO speed test and only enable compression if the
compress/decompress speed is at least the IO rate.
It's not that simple because neither the user space writers nor
kworkers are single threaded. You'd need a particularly fast NVMe
matched with a not so fast CPU with a workload that somehow dumps a
lot of data in a way that the compression acts as a bottle neck.
It could exist. But it's not a per se problem that I've seen. But if
you propose a test, I can do A/B testing.
--
Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure