Re: Transparent compression with ext4 - especially with zstd

"Kiselev, Oleg" <okiselev@xxxxxxxxxx> · Wed, 22 Jan 2025 00:19:57 +0000

MySQL, MariaDB and PostgreSQL do their own, schema and page-size aware compression.  Why not let the databases do this?  They are in a better position to do it and trade off the costs where and when it matters to them.
-- 
Oleg Kiselev 

On 1/21/25, 11:35, "Theodore Ts'o" <tytso@xxxxxxx <mailto:tytso@xxxxxxx>> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Tue, Jan 21, 2025 at 07:47:24PM +0100, Gerhard Wiesinger wrote:
> We are talking in some scenarios about some factors of diskspace. E.g. in
> my database scenario with PostgreSQL around 85% of disk space can be saved
> (e.g. around factor 7).

So the problem with using compression with databases is that they need
to be able to do random writes into the middle of a file. So that
means you need to use tricks such as writing into clusters, typically
32k or 64k. What this means is that a single 4k random write gets
amplified into a 32k or 64k write.

> In cloud usage scenarios you can easily reduce that amount of allocated
> diskspace by around a factor 7 and reduce cost therefore.

If you are running this on a cloud platform, where you are limited (on
GCE) or charged (on AWS) by IOPS and throughput, this can be a
performance bottleneck (or cost you extra). At the minimum the extra
I/O throughput will very likely show up on various performance
benchmarks.

Worse, using a transparent compression breaks the ACID properties of
the database. If you crash or have a power failure while rewriting
the 64k compression cluster, all or part of that 64k compression
cluster can be corrupted. And if your customers care about (their)
data integrity, the fact that you cheaped out on disk space might not
be something that would impress them terribly.

The short version is that transparent compression is not free, even if
you ignore the SWE development costs of implementing such a feature,
and then getting that feature to be fit for use in an enterprise use
case. No matter what file system you might want to use, I *strongly*
suggest that you get a power fail rack and try putting the whole stack
on said power fail rack, and try dropping power while running a stress
test --- over, and over, and over again. What you might find would
surprise you.

> The technical topic is that IMHO no stable and practical usable Linux
> filesystem which is included in the default kernel exists.
> - ZFS works but is not included in the default kernel
> - BTRFS has stability and repair issues (see mailing lists) and bugs with
> compression (does not compress on the fly in some scenarios)
> - bcachefs is experimental

When I started work at Google 15 years ago to deploy ext4 into
production, we did precisely this, and as well as deploying to a small
percentage of Google's test fleet to do A:B comparisons before we
deployed to the entire production fleet.

Whether or not it is "practical" and "usable" depends on your
definition, I guess, but from my perspective "stable" and "not losing
users' data" is job #1.

But hey, if it's worth so much to you, I suggest you cost out what it
would cost to actually implement the features that you so much want,
or how much it would cost to make the more complex file systems to be
stable for production use. You might decide that paying the extra
storage costs is way cheaper than software engineering investment
costs involved. At Google, and when I was at IBM before that, we were
always super disciplined about trying to figure out the ROI costs of
some particular project and not just doing it because it was "cool".

There's a famous story about how the engineers working on ZFS didn't
ask for management's permission or input from the sales team before
they started. Sounds great, and there was some cool technology there
in ZFS --- but note that Sun had to put the company up for sale
because they were losing money...

Cheers,

- Ted

P.S. Note: using a compression cluster is the only real way to
support transparent compression if you are using an update-in-place
file system like ext4 or xfs. (And that is what was coverd by the
Stac patents that I mentioned.)

If you are using a log-structed file system, such as ZFS, then you can
simply rewrite the compression cluster *and* update the file system
metadata to point at the new compression cluster --- but then the
garbage collection costs, and the file system metadata update costs
for each database commit are *huge*, and the I/O throughput hit is
even higher. So much so that ZFS recommends that you turn off the
log-structured write and do update-in-place if you want to use a
database on ZFS. But I'm pretty sure that this disables transparent
compression if you are using update-in-place. TNSTAAFL.