Re: agenda for todays QA meeting

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Tue, 21 Jul 2020 16:06:58 -0600

On Tue, Jul 21, 2020 at 3:36 PM pmkellly@xxxxxxxxxxxx
<pmkellly@xxxxxxxxxxxx> wrote:
>
>
>
> On 7/21/20 13:11, Chris Murphy wrote:
> >
> > Yeah, lossy algorithms are common in imaging. There are many kinds
> > that unquestionably do not produce identical encoding to the original
> > once decompressed. The algorithms being used by Btrfs are all lossless
> > compression, and in fact those are also commonly used in imaging: LZO
> > and ZLIB (ZIP, i.e. deflate) - and in that case you can compress and
> > decompress images unlimited times and always get back identical RGB
> > encodings to the original. Short of memory or other hardware error.
> >
>
> At the risk of sounding skeptical,  I've heard that word "lossless"
> applied to lots of algorithms and devices that I didn't think was an
> appropriate usage. As an approximate example, when we were doing that
> testing we were hoping to find something in the neighborhood of 10-6
> probability of a single byte error in a certain structure / size file
> when exercised a certain number of times. Sorry for being so vague. Is
> there any statistical data on these algorithms that is publicly
> available? The only ones I've ever seen (not a large population since
> I've been a compression avoid-er) that approach lossless don't compress
> much and only take out strings of the same byte value.

A very simple example is run length encoding.
https://en.wikipedia.org/wiki/Run-length_encoding

That is variable depending on the source, but quite a lot of human
produced material has a metric F ton of zeros in it, so it turns out
we get a lot of compressibility. This is used by the current zram
default algorithm, as well as lzo which handles the more complex data.
This is typically a 3 to 1 upwards of 4 to 1 compression ratio in my
testing, with a conservative 2 to 1 stated in the proposal.

zstd is substantially more complex than lzo, or zlib, and produces
similar compression ratios to zlib but at a fraction of the CPU
requirement. You can compress and decompress things all day long for
weeks and months and years and 100% of the time get back identical
data bit for bit. That's the point of them. I can't really explain the
math but zstd is free open source software, so it is possible to
inspect it.

https://github.com/facebook/zstd

JPEG compression, on the other hand, is intentionally lossy. It is a
guarantee that you do not get back original data. It can still be used
in high end imaging (and it is) but this is predicated on reducing the
number of times the image goes through JPEG compression - or else you
end up with obvious artifacts that degrade the image. All of the lossy
algorithms involve, in a sense, a kind of data loss. That's the point
of them, is that there's so much extraneous information in imaging
that quite a lot of it can just be tossed. But this also assumes the
final destination is some kind of shitty output: displays, printers,
printing presses. That sort of thing. So the loss isn't actually
realized, if you do it correctly anyway. Trouble is, quite a lot of
people do take JPEG, modify them, and then JPEG them again. Which is
known as "doing it wrong" - you need to go back to the original image
to make that modification, and then JPEG it. If you don't have the
original well then you're making other bad choices :) Or maybe someone
else is.

I'm off hand not aware of any lossy compression algorithms that claim
to be lossless. The original JPEG is lossy. JPEG2000 has both lossy
and lossless variants, but while it's produced by the same
organization, the encoding is entirely different. Anyway, short of
hardware defects, you can compress and decompress data or images using
lossless compression billions of times until the heat death of the
universe and get identical bits out. It's the same as 2+2=4 and 4=2+2.
Same exact information on both sides of the encoding. Anything else is
a hardware error, sunspots, cosmic rays, someone made a mistake in
testing, etc.

> Sorry, I have no knowledge of the history of btrfs; so please forgive me
> when I say or ask silly things.

I don't think asking questions is silly or a problem at all. It's the
jumping to conclusions that gave me the frowny face. :-)

> What's considered the meta data. Path to file, file name, file header,
> file footer, data layout?

In a file system context, the metadata is the file system itself. The
data is the "payload" of the file, the file contents, the stuff you
actually care about. I mean, you might also care about some of the
metadata: file name, creation/modification date, but that's  probably
incidental to the data. The metadata includes the size of the data,
whether or not it's compressed, its checksum, the inode, owner, group,
posix permissions, selinux label, etc.

> Oh I just noticed crc32c. That's acceptable.

This is the default. It's acceptable for detecting incidental sources
of corruption. Since kernel 5.5 there's also xxhash64, which is about
as fast as crc32c, sometimes faster on some hardware. And for
cryptographic hashing Btrfs offers blake2b (SHA3 runner up) and
sha256. These are mkfs time options.

-- 
Chris Murphy
_______________________________________________
test mailing list -- test@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to test-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/test@xxxxxxxxxxxxxxxxxxxxxxx