Re: Fedora 34 Change: Enable btrfs transparent zstd compression by default (System-Wide Change proposal)

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Sun, 14 Feb 2021 13:20:44 -0700

On Sat, Feb 13, 2021 at 9:45 PM Jeremy Linton <jeremy.linton@xxxxxxx> wrote:
>
> Hi,
>
> On 2/11/21 11:05 PM, Chris Murphy wrote:
> > On Thu, Feb 11, 2021 at 9:58 AM Jeremy Linton <jeremy.linton@xxxxxxx> wrote:
> >>
> >> Hi,
> >>
> >> On 1/1/21 8:59 PM, Chris Murphy wrote:
> >
> >>> Anyway, compress=zstd:1 is a good default. Everyone benefits, and I'm
> >>> not even sure someone with a very fast NVMe drive will notice a slow
> >>> down because the compression/decompression is threaded.
> >>
> >> I disagree that everyone benefits. Any read latency sensitive workload
> >> will be slower due to the application latency being both the drive
> >> latency plus the decompression latency. And as the kernel benchmarks
> >> indicate very few systems are going to get anywhere near the performance
> >> of even baseline NVMe drives when its comes to throughput.
> >
> > It's possible some workloads on NVMe might have faster reads or writes
> > without compression.
> >
> > https://github.com/facebook/zstd
> >
> > btrfs compress=zstd:1 translates into zstd -1 right now; there are
> > some ideas to remap btrfs zstd:1 to one of the newer zstd --fast
> > options, but it's just an idea. And in any case the default for btrfs
> > and zstd will remain as 3 and -3 respectively, which is what
> > 'compress=zstd' maps to, making it identical to 'compress=zstd:3'.
> >
> > I have a laptop with NVMe and haven't come across such a workload so
> > far, but this is obviously not a scientific sample. I think you'd need
> > a process that's producing read/write rates that the storage can meet,
> > but that the compression algorithm limits. Btrfs is threaded, as is
> > the compression.
> >
> > What's typical, is no change in performance and sometimes a small
> > small increase in performance. It essentially trades some CPU cycles
> > in exchange for less IO. That includes less time reading and writing,
> > but also less latency, meaning the gain on rotational media is
> > greater.
> >
> >> Worse, if the workload is very parallel, and at max CPU already
> >> the compression overhead will only make that situation worse as well. (I
> >> suspect you could test this just by building some packages that have
> >> good parallelism during the build).
> >
> > This is compiling the kernel on a 4/8-core CPU (circa 2011) using make
> > -j8, the kernel running is 5.11-rc7.
> >
> > no compression
> >
> > real    55m32.769s
> > user    369m32.823s
> > sys     35m59.948s
> >
> > ------
> >
> > compress=zstd:1
> >
> > real    53m44.543s
> > user    368m17.614s
> > sys     36m2.505s
> >
> > That's a one time test, and it's a ~3% improvement. *shrug* We don't
> > really care too much these days about 1-3% differences when doing
> > encryption, so I think this is probably in that ballpark, even if it
> > turns out another compile is 3% slower. This is not a significantly
> > read or write centric workload, it's mostly CPU. So this 3% difference
> > may not even be related to the compression.
>
> Did you drop caches/etc between runs?

Yes. And also did the test with uncompressed source files when
compiling without the compress mount option. And compressed source
files when compiling with the compress mount option. While it's
possible to mix those around (there's four combinations), I kept them
the same since those are the most common.

>Because I git cloned mainline,
> copied the fedora kernel config from /boot and on a fairly recent laptop
> (12 threads) with a software encrypted NVMe. Dropped caches and did a
> time make against a compressed directory and an uncompressed one with
> both a semi constrained (4G) setup and 32G ram setup (compressed
> swapping disabled, because the machine has an encrypted swap for
> hibernation and crashdumps).
>
> compressed:
> real    22m40.129s
> user    221m9.816s
> sys     23m37.038s
>
> uncompressed:
> real    21m53.366s
> user    221m56.714s
> sys     23m39.988s
>
> uncompressed 4G ram:
> real    28m48.964s
> user    288m47.569s
> sys     30m43.957s
>
> compressed 4G
> real    29m54.061s
> user    281m7.120s
> sys     29m50.613s
>

While the feature page doesn't claim it always increases performance,
it also doesn't say it can reduce performance. In the CPU intensive
workloads, it stands to reason there's going to be some competition.
The above results strongly suggest that's what's going on, even if I
couldn't reproduce it. But performance gain/loss isn't the only factor
for consideration.

> and that is not an IO constrained workload its generally cpu
> constrained, and since the caches are warm due to the software
> encryption the decompress times should be much faster than machines that
> aren't cache stashing.

I don't know, so I can't confirm or deny any of that.

> The machine above, can actually peg all 6 cores until it hits thermal
> limits simply doing cp's with btrfs/zstd compression, all the while
> losing about 800MB/sec of IO bandwidth over the raw disk. Turning an IO
> bound problem into a CPU bound one isn't ideal.

It's a set of tradeoffs. And there isn't a governor that can assess
when an IO bound bottleneck becomes a CPU bound one.

> Compressed disks only work in the situation where the CPUs can
> compress/decompress faster than the disk, or the workload is managing to
> significantly reduce IO because the working set is streaming rather than
> random.

This isn't sufficiently qualified. It does work to reduce space
consumption and write amplification. It's just that there's a tradeoff
that you dislike, which is IO reduction. And it's completely
reasonable to have a subjective position on this tradeoff. But no
matter what there is a consequence to the choice.

>Any workload which has a random read component to it and is
> tending closer to page sized read/writes is going to get hurt, and god
> help if its a RMW cycle.

Why?

Note that not everything gets compressed. There is an estimation
whether compression is worth it, because it's not worth doing
compression on small amount of data unless it means a reduction in
used blocks. i.e. if 6KiB can be compressed such that it involves
writing to one 4KiB sector instead of two, then the compression
happens. If it's 4KiB data that can't be compressed, it isn't
attempted and is written in a 4KiB sector. Since the minimum block
size is 4KiB (on x86) the only way it would be compressed is if the
compression means it'll be written as an inline extent in the metadata
leaf along with its inode, i.e. 2048 bytes or less.

A larger file might have a mix of compressed and non-compressed
extents, based on this "is it worth it" estimate. This is the
difference between the compress and compress-force options, where
force drops this estimator and depends on the compression algorithm to
do that work. I sometimes call that estimator the "early bailout"
check.

>Similarly for parallelized compression, which
> is only scalable if the IO sizes are large enough that its worth the IPI
> overhead of bringing additional cores online and the resulting chunks
> are still large enough to be dealt with individually.

I don't know if it's possible for PSI to be used, CPU pressure and
separately context switches, to decide to inhibit compression entirely
- including the estimation of whether it's worth it. That'd be a
question for upstream so I'll ask. But right now I'd say if you
estimate that the enabling of this feature isn't worth it, turn it
off. If the further argument is that no one should have it enabled by
default, then I think there's a somewhat heavier burden to make a
compelling argument.

-- 
Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure