Re: Fedora 34 Change: Enable btrfs transparent zstd compression by default (System-Wide Change proposal)

Jeremy Linton <jeremy.linton@xxxxxxx> · Tue, 16 Feb 2021 17:10:28 -0600

Hi,

On 2/14/21 2:20 PM, Chris Murphy wrote:
On Sat, Feb 13, 2021 at 9:45 PM Jeremy Linton <jeremy.linton@xxxxxxx> wrote:

Hi,

On 2/11/21 11:05 PM, Chris Murphy wrote:
On Thu, Feb 11, 2021 at 9:58 AM Jeremy Linton <jeremy.linton@xxxxxxx> wrote:

Hi,

On 1/1/21 8:59 PM, Chris Murphy wrote:

Anyway, compress=zstd:1 is a good default. Everyone benefits, and I'm
not even sure someone with a very fast NVMe drive will notice a slow
down because the compression/decompression is threaded.

I disagree that everyone benefits. Any read latency sensitive workload
will be slower due to the application latency being both the drive
latency plus the decompression latency. And as the kernel benchmarks
indicate very few systems are going to get anywhere near the performance
of even baseline NVMe drives when its comes to throughput.

It's possible some workloads on NVMe might have faster reads or writes
without compression.

https://github.com/facebook/zstd

btrfs compress=zstd:1 translates into zstd -1 right now; there are
some ideas to remap btrfs zstd:1 to one of the newer zstd --fast
options, but it's just an idea. And in any case the default for btrfs
and zstd will remain as 3 and -3 respectively, which is what
'compress=zstd' maps to, making it identical to 'compress=zstd:3'.

I have a laptop with NVMe and haven't come across such a workload so
far, but this is obviously not a scientific sample. I think you'd need
a process that's producing read/write rates that the storage can meet,
but that the compression algorithm limits. Btrfs is threaded, as is
the compression.

What's typical, is no change in performance and sometimes a small
small increase in performance. It essentially trades some CPU cycles
in exchange for less IO. That includes less time reading and writing,
but also less latency, meaning the gain on rotational media is
greater.

Worse, if the workload is very parallel, and at max CPU already
the compression overhead will only make that situation worse as well. (I
suspect you could test this just by building some packages that have
good parallelism during the build).

This is compiling the kernel on a 4/8-core CPU (circa 2011) using make
-j8, the kernel running is 5.11-rc7.

no compression

real    55m32.769s
user    369m32.823s
sys     35m59.948s

------

compress=zstd:1

real    53m44.543s
user    368m17.614s
sys     36m2.505s

That's a one time test, and it's a ~3% improvement. *shrug* We don't
really care too much these days about 1-3% differences when doing
encryption, so I think this is probably in that ballpark, even if it
turns out another compile is 3% slower. This is not a significantly
read or write centric workload, it's mostly CPU. So this 3% difference
may not even be related to the compression.

Did you drop caches/etc between runs?

Yes. And also did the test with uncompressed source files when
compiling without the compress mount option. And compressed source
files when compiling with the compress mount option. While it's
possible to mix those around (there's four combinations), I kept them
the same since those are the most common.

Because I git cloned mainline,
copied the fedora kernel config from /boot and on a fairly recent laptop
(12 threads) with a software encrypted NVMe. Dropped caches and did a
time make against a compressed directory and an uncompressed one with
both a semi constrained (4G) setup and 32G ram setup (compressed
swapping disabled, because the machine has an encrypted swap for
hibernation and crashdumps).

compressed:
real    22m40.129s
user    221m9.816s
sys     23m37.038s

uncompressed:
real    21m53.366s
user    221m56.714s
sys     23m39.988s

uncompressed 4G ram:
real    28m48.964s
user    288m47.569s
sys     30m43.957s

compressed 4G
real    29m54.061s
user    281m7.120s
sys     29m50.613s

While the feature page doesn't claim it always increases performance,
it also doesn't say it can reduce performance. In the CPU intensive
workloads, it stands to reason there's going to be some competition.
The above results strongly suggest that's what's going on, even if I
couldn't reproduce it. But performance gain/loss isn't the only factor
for consideration.

and that is not an IO constrained workload its generally cpu
constrained, and since the caches are warm due to the software
encryption the decompress times should be much faster than machines that
aren't cache stashing.

I don't know, so I can't confirm or deny any of that.

The machine above, can actually peg all 6 cores until it hits thermal
limits simply doing cp's with btrfs/zstd compression, all the while
losing about 800MB/sec of IO bandwidth over the raw disk. Turning an IO
bound problem into a CPU bound one isn't ideal.

It's a set of tradeoffs. And there isn't a governor that can assess
when an IO bound bottleneck becomes a CPU bound one.

Compressed disks only work in the situation where the CPUs can
compress/decompress faster than the disk, or the workload is managing to
significantly reduce IO because the working set is streaming rather than
random.

This isn't sufficiently qualified. It does work to reduce space
consumption and write amplification. It's just that there's a tradeoff
that you dislike, which is IO reduction. And it's completely
reasonable to have a subjective position on this tradeoff. But no
matter what there is a consequence to the choice.

IO reduction in some cases (see below), for additional read latency, and 
and increase in CPU utilization.

For a desktop workload the former is likely a larger problem. But as we 
all know sluggishness is a hard thing to measure on a desktop. QD1 
pointer chasing on disk though is a good approximation, sometimes boot 
times are too.

Any workload which has a random read component to it and is
tending closer to page sized read/writes is going to get hurt, and god
help if its a RMW cycle.

Why?

Note that not everything gets compressed. There is an estimation
whether compression is worth it, because it's not worth doing
compression on small amount of data unless it means a reduction in
used blocks. i.e. if 6KiB can be compressed such that it involves
writing to one 4KiB sector instead of two, then the compression
happens. If it's 4KiB data that can't be compressed, it isn't
attempted and is written in a 4KiB sector. Since the minimum block
size is 4KiB (on x86) the only way it would be compressed is if the
compression means it'll be written as an inline extent in the metadata
leaf along with its inode, i.e. 2048 bytes or less.

A larger file might have a mix of compressed and non-compressed
extents, based on this "is it worth it" estimate. This is the
difference between the compress and compress-force options, where
force drops this estimator and depends on the compression algorithm to
do that work. I sometimes call that estimator the "early bailout"
check.

Compression estimation is its own ugly ball of wax. But ignoring that 
for the moment, consider what happens if you have a bunch of 2G database 
files with a reasonable compression ratio. Lets assume for a moment the 
database attempts to update records in the middle of the files. What 
happens when the compression ratio gets slightly worse? (its likely you 
already have nodatacow). The usual solution is that what should be a 
mostly sequential file starts to become fragmented. If its fragmenting 
on every update what seems an edge case suddenly becomes a serious 
problem. That is is an entirely different form of write amplification. 
The alternative is attempt to recompress/rewrite fairly large parts of 
the stream. I'm not sure if pgbench is "compression" aware of or if its 
data is realistic enough but it might be entertaining to see what 
happens between nodatacow and compress. Although this becomes a case of 
seeing if the "compression estimation" logic is smart enough to detect 
its causing poor IO patterns (while still having a reasonably good 
compression ratio).

In a past life, I spent a non inconsequential part of a decade 
engineering compressed ram+storage systems (similar to what has been 
getting merged to mainline over the past few years). Its really hard to 
make one that is performant across a wide range of workloads. What you 
get are areas where it can help, but if you average those case with the 
ones where it hurts the overwhelming analysis is you shouldn't be 
compressing unless you want the capacity. The worse part is that most 
synthetic file IO benchmarks tend to be on the "it helps" side of the 
equation and the real applications on the other.

IMHO if fedora wanted to take a hit on the IO perf side, a much better 
place to focus would be flipping encryption on. The perf profile is 
flatter (aes-ni & the arm crypto extensions are common) with fewer evil 
edge cases. Or a more controlled method might to be picking a couple 
fairly atomic directories and enabling compression there (say /usr).

Similarly for parallelized compression, which
is only scalable if the IO sizes are large enough that its worth the IPI
overhead of bringing additional cores online and the resulting chunks
are still large enough to be dealt with individually.

I don't know if it's possible for PSI to be used, CPU pressure and
separately context switches, to decide to inhibit compression entirely
- including the estimation of whether it's worth it. That'd be a
question for upstream so I'll ask. But right now I'd say if you
estimate that the enabling of this feature isn't worth it, turn it
off. If the further argument is that no one should have it enabled by
default, then I think there's a somewhat heavier burden to make a
compelling argument.

_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure