Re: Transparent compression with ext4 - especially with zstd

Gerhard Wiesinger <lists@xxxxxxxxxxxxx> · Wed, 22 Jan 2025 07:47:22 +0100

On 21.01.2025 22:26, Dave Chinner wrote:
On Tue, Jan 21, 2025 at 07:47:24PM +0100, Gerhard Wiesinger wrote:
On 21.01.2025 05:01, Theodore Ts'o wrote:
On Sun, Jan 19, 2025 at 03:37:27PM +0100, Gerhard Wiesinger wrote:
Are there any plans to include transparent compression with ext4 (especially
with zstd)?
I'm not aware of anyone in the ext4 deveopment commuity working on
something like this.  Fully transparent compression is challenging,
since supporting random writes into a compressed file is tricky.
There are solutions (for example, the Stac patent which resulted in
Microsoft to pay $120 million dollars), but even ignoring the
intellectual property issues, they tend to compromise the efficiency
of the compression.

More to the point, given how cheap byte storage tends to be (dollars
per IOPS tend to be far more of a constraint than dollars per GB),
it's unclear what the business case would be for any company to fund
development work in this area, when the cost of a slightly large HDD
or SSD is going to be far cheaper than the necessary software
engineering investrment needed, even for a hyperscaler cloud company
(and even there, it's unclear that transparent compression is really
needed).

What is the business and/or technical problem which you are trying to
solve?

Regarding necessity:
We are talking in some scenarios about some factors of diskspace. E.g. in my
database scenario with PostgreSQL around 85% of disk space can be saved
(e.g. around factor 7).
So use a database that has built-in data compression capabilities.

e.g. Mysql has transparent table compression functionality.
This requires sparse files and FALLOC_FL_PUNCH_HOLE support in the
filesystem, but there is no need for any special filesystem side
support for data compression to get space gains of up to 75% on
compressible data sets with the default database (16kB record size)
and filesystem configs (4kB block size).

The argument that "application level compression is hard, so we want
the filesystem to do it for us" ignores the fact that it is -much
harder- to do efficient compression in the filesystem than at the
application level.

The OS and filesystem doesn't have the freedom to control
application level data access patterns nor tailor the compression
algorithms to match how the application manages data, so everything
the filesystem implements is a compromise. It will never be optimal
for any given workload, because we have to make sure that it is
not complete garbage for any given workload...

MySQL/MariaDB isnt't an option for me. But will look into this.

In cloud usage scenarios you can easily reduce that amount of allocated
diskspace by around a factor 7 and reduce cost therefore.
Same argument: cloud applications should be managing their data
sets appropriately and efficiently, not relying on the cloud storage
infrastructure to magically do stuff to "reduce costs" for them.

Remeber: there's a massive conflict of interest on the vendor side
here - the less efficient the application (be it CPU, RAM or storage
capacity), the more money the cloud vendor makes from users running
that application. Hence they have little motivation to provide
infrastructure or application functionality that costs them money to
implement and has the impact of reducing their overall revenue
stream...

Right, therefore we want to make the storage usage as small as possible 
either on appication level or filesystem level.

You might also get a performance boost by using caching mechanism more
efficient (e.g. using less RAM).
Not true. Linux caches uncompressed data in the page cache - caching
compressed data will significantly increase the memory footprint and
CPU consumption as it has to be constantly uncompressed and
recompressed as the data changes. This is not a viable caching
strategy for a general purpose OS.

AFAIK ZFS caches compressed data in the ARC cache. zstd really has a 
very low overhead on decompression with a very good compression ratio 
(even better than gz and bz2).

Also with precompressed files (e.g. photo, videos) you can safe around 5-10%
Video and photos do not compress sufficiently to be a viable runtime
compression target for filesystem based compression. It's a massive
waste of resources to attempt compression of internally compressed
data formats for anything but cold data storage. And even then, if
it's cold storage then the data should be compressed and checksummed
by the cold storage application before it is written to the
filesystem.

ZFS uses with zstd the lz4 "early abort" feature which detects with very 
low CPU ressources that not compression is necessary and aborts the 
compression and stores it uncompressed. If lz4 doesn't abort early, zstd 
compression is used. So there are solutions for low ressource usage.

Reagarding rations: In my case 3%:

zfs list -o name,compressratio,compression big/shares/fotovideo
NAME                  RATIO  COMPRESS
big/shares/fotovideo  1.03x  zstd-3

The technical topic is that IMHO no stable and practical usable Linux
filesystem which is included in the default kernel exists.
- ZFS works but is not included in the default kernel
- BTRFS has stability and repair issues (see mailing lists) and bugs with
compression (does not compress on the fly in some scenarios)
I hear this sort of generic "btrfs is not stable/has bugs" complaint
as a reason for not using btrfs all the time.

That's my practical experience. I tried BTRFS several times and failed 
on testing and production. Had a storage topic where some blocks 
(several thousand 4k blocks were damaged). On top several VMs were running.

All other filesystems (XFS, ext4, ZFS, UFS2, ) except BTRFS and bcachefs 
(which is experimental) were repairable to a consistent state (of course 
with some blocks lost).

You can repair BTRFS "forever" without getting it into a consistent state.

A friend of mine had also the experience that it was not mountable and 
crashed immediately after a reboot ...

Find the details here on the mailing list: 
https://marc.info/?l=linux-btrfs&m=172519149923874&w=2

I hear just as many, if not more, generic "XFS is unstable and loses
data" claims as a reason for not using XFS, too.

I'm not having that experience. But I try to use ext4 primarily as it is 
best for "repair" scenarios.

Anecdotal claims are not proof of fact, and I don't see any real
evidence that btrfs is unstable.  e.g. Fedora has been using btrfs
as the root filesystem (and has for quite a while now) and there has
been no noticable increase in bug reports (either for fs
functionality or data loss) compared to when ext4 or XFS was used as
the default filesystem type...

That are not anecdotal claims that's my practical experience that BTRFS 
is not stable and repairable to a consisent state. Reproduceable, you 
can try for yourself.

I'm using Fedora since Fedora FC1 for all production systems.

IOWs, I redirect generic "btrfs is unstable" complaints to /dev/null
these days, just like I do with generic "XFS is unstable"
complaints.

Try it and you will see it that it is non repairable. You can find 
details and testcase (simulation of what I had on overwriting random 
blocks) in the link.

As with Fedora I'm using latest and "fresh" stable kernel versions as 
well as filesystem utilities. I'm still having that "unrepairable" 
original BTRFS filesystem and will try to repair it to a consistent 
state from time to time. Until now not successful.

Find the details here on the mailing list: 
https://marc.info/?l=linux-btrfs&m=172519149923874&w=2

So you should't redirect the complaints to /dev/null to get BTRFS better :-)

Thnx.

Ciao,

Gerhard