Hi,
On 2/14/21 2:20 PM, Chris Murphy wrote:
On Sat, Feb 13, 2021 at 9:45 PM Jeremy Linton <jeremy.linton@xxxxxxx> wrote:
Hi,
On 2/11/21 11:05 PM, Chris Murphy wrote:
On Thu, Feb 11, 2021 at 9:58 AM Jeremy Linton <jeremy.linton@xxxxxxx> wrote:
Hi,
On 1/1/21 8:59 PM, Chris Murphy wrote:
Anyway, compress=zstd:1 is a good default. Everyone benefits, and I'm
not even sure someone with a very fast NVMe drive will notice a slow
down because the compression/decompression is threaded.
I disagree that everyone benefits. Any read latency sensitive workload
will be slower due to the application latency being both the drive
latency plus the decompression latency. And as the kernel benchmarks
indicate very few systems are going to get anywhere near the performance
of even baseline NVMe drives when its comes to throughput.
It's possible some workloads on NVMe might have faster reads or writes
without compression.
https://github.com/facebook/zstd
btrfs compress=zstd:1 translates into zstd -1 right now; there are
some ideas to remap btrfs zstd:1 to one of the newer zstd --fast
options, but it's just an idea. And in any case the default for btrfs
and zstd will remain as 3 and -3 respectively, which is what
'compress=zstd' maps to, making it identical to 'compress=zstd:3'.
I have a laptop with NVMe and haven't come across such a workload so
far, but this is obviously not a scientific sample. I think you'd need
a process that's producing read/write rates that the storage can meet,
but that the compression algorithm limits. Btrfs is threaded, as is
the compression.
What's typical, is no change in performance and sometimes a small
small increase in performance. It essentially trades some CPU cycles
in exchange for less IO. That includes less time reading and writing,
but also less latency, meaning the gain on rotational media is
greater.
Worse, if the workload is very parallel, and at max CPU already
the compression overhead will only make that situation worse as well. (I
suspect you could test this just by building some packages that have
good parallelism during the build).
This is compiling the kernel on a 4/8-core CPU (circa 2011) using make
-j8, the kernel running is 5.11-rc7.
no compression
real 55m32.769s
user 369m32.823s
sys 35m59.948s
------
compress=zstd:1
real 53m44.543s
user 368m17.614s
sys 36m2.505s
That's a one time test, and it's a ~3% improvement. *shrug* We don't
really care too much these days about 1-3% differences when doing
encryption, so I think this is probably in that ballpark, even if it
turns out another compile is 3% slower. This is not a significantly
read or write centric workload, it's mostly CPU. So this 3% difference
may not even be related to the compression.
Did you drop caches/etc between runs?
Yes. And also did the test with uncompressed source files when
compiling without the compress mount option. And compressed source
files when compiling with the compress mount option. While it's
possible to mix those around (there's four combinations), I kept them
the same since those are the most common.
Because I git cloned mainline,
copied the fedora kernel config from /boot and on a fairly recent laptop
(12 threads) with a software encrypted NVMe. Dropped caches and did a
time make against a compressed directory and an uncompressed one with
both a semi constrained (4G) setup and 32G ram setup (compressed
swapping disabled, because the machine has an encrypted swap for
hibernation and crashdumps).
compressed:
real 22m40.129s
user 221m9.816s
sys 23m37.038s
uncompressed:
real 21m53.366s
user 221m56.714s
sys 23m39.988s
uncompressed 4G ram:
real 28m48.964s
user 288m47.569s
sys 30m43.957s
compressed 4G
real 29m54.061s
user 281m7.120s
sys 29m50.613s
While the feature page doesn't claim it always increases performance,
it also doesn't say it can reduce performance. In the CPU intensive
workloads, it stands to reason there's going to be some competition.
The above results strongly suggest that's what's going on, even if I
couldn't reproduce it. But performance gain/loss isn't the only factor
for consideration.
and that is not an IO constrained workload its generally cpu
constrained, and since the caches are warm due to the software
encryption the decompress times should be much faster than machines that
aren't cache stashing.
I don't know, so I can't confirm or deny any of that.
The machine above, can actually peg all 6 cores until it hits thermal
limits simply doing cp's with btrfs/zstd compression, all the while
losing about 800MB/sec of IO bandwidth over the raw disk. Turning an IO
bound problem into a CPU bound one isn't ideal.
It's a set of tradeoffs. And there isn't a governor that can assess
when an IO bound bottleneck becomes a CPU bound one.
Compressed disks only work in the situation where the CPUs can
compress/decompress faster than the disk, or the workload is managing to
significantly reduce IO because the working set is streaming rather than
random.
This isn't sufficiently qualified. It does work to reduce space
consumption and write amplification. It's just that there's a tradeoff
that you dislike, which is IO reduction. And it's completely
reasonable to have a subjective position on this tradeoff. But no
matter what there is a consequence to the choice.
IO reduction in some cases (see below), for additional read latency, and
and increase in CPU utilization.
For a desktop workload the former is likely a larger problem. But as we
all know sluggishness is a hard thing to measure on a desktop. QD1
pointer chasing on disk though is a good approximation, sometimes boot
times are too.
Any workload which has a random read component to it and is
tending closer to page sized read/writes is going to get hurt, and god
help if its a RMW cycle.
Why?
Note that not everything gets compressed. There is an estimation
whether compression is worth it, because it's not worth doing
compression on small amount of data unless it means a reduction in
used blocks. i.e. if 6KiB can be compressed such that it involves
writing to one 4KiB sector instead of two, then the compression
happens. If it's 4KiB data that can't be compressed, it isn't
attempted and is written in a 4KiB sector. Since the minimum block
size is 4KiB (on x86) the only way it would be compressed is if the
compression means it'll be written as an inline extent in the metadata
leaf along with its inode, i.e. 2048 bytes or less.
A larger file might have a mix of compressed and non-compressed
extents, based on this "is it worth it" estimate. This is the
difference between the compress and compress-force options, where
force drops this estimator and depends on the compression algorithm to
do that work. I sometimes call that estimator the "early bailout"
check.
Compression estimation is its own ugly ball of wax. But ignoring that
for the moment, consider what happens if you have a bunch of 2G database
files with a reasonable compression ratio. Lets assume for a moment the
database attempts to update records in the middle of the files. What
happens when the compression ratio gets slightly worse? (its likely you
already have nodatacow). The usual solution is that what should be a
mostly sequential file starts to become fragmented. If its fragmenting
on every update what seems an edge case suddenly becomes a serious
problem. That is is an entirely different form of write amplification.
The alternative is attempt to recompress/rewrite fairly large parts of
the stream. I'm not sure if pgbench is "compression" aware of or if its
data is realistic enough but it might be entertaining to see what
happens between nodatacow and compress. Although this becomes a case of
seeing if the "compression estimation" logic is smart enough to detect
its causing poor IO patterns (while still having a reasonably good
compression ratio).
In a past life, I spent a non inconsequential part of a decade
engineering compressed ram+storage systems (similar to what has been
getting merged to mainline over the past few years). Its really hard to
make one that is performant across a wide range of workloads. What you
get are areas where it can help, but if you average those case with the
ones where it hurts the overwhelming analysis is you shouldn't be
compressing unless you want the capacity. The worse part is that most
synthetic file IO benchmarks tend to be on the "it helps" side of the
equation and the real applications on the other.
IMHO if fedora wanted to take a hit on the IO perf side, a much better
place to focus would be flipping encryption on. The perf profile is
flatter (aes-ni & the arm crypto extensions are common) with fewer evil
edge cases. Or a more controlled method might to be picking a couple
fairly atomic directories and enabling compression there (say /usr).
Similarly for parallelized compression, which
is only scalable if the IO sizes are large enough that its worth the IPI
overhead of bringing additional cores online and the resulting chunks
are still large enough to be dealt with individually.
I don't know if it's possible for PSI to be used, CPU pressure and
separately context switches, to decide to inhibit compression entirely
- including the estimation of whether it's worth it. That'd be a
question for upstream so I'll ask. But right now I'd say if you
estimate that the enabling of this feature isn't worth it, turn it
off. If the further argument is that no one should have it enabled by
default, then I think there's a somewhat heavier burden to make a
compelling argument.
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure