Re: agenda for todays QA meeting

"pmkellly@xxxxxxxxxxxx" <pmkellly@xxxxxxxxxxxx> · Wed, 22 Jul 2020 10:54:43 -0400

On 7/21/20 18:06, Chris Murphy wrote:
On Tue, Jul 21, 2020 at 3:36 PM pmkellly@xxxxxxxxxxxx
<pmkellly@xxxxxxxxxxxx> wrote:
The only ones I've ever seen (not a large population since
I've been a compression avoid-er) that approach lossless don't compress
much and only take out strings of the same byte value.

A very simple example is run length encoding.
https://en.wikipedia.org/wiki/Run-length_encoding

That's what I meant by "take out strings of the same byte value". I had 
just forgotten the name.

That is variable depending on the source, but quite a lot of human
produced material has a metric F ton of zeros in it, so it turns out
we get a lot of compressibility. This is used by the current zram
default algorithm, as well as lzo which handles the more complex data.
This is typically a 3 to 1 upwards of 4 to 1 compression ratio in my
testing, with a conservative 2 to 1 stated in the proposal.

Wow... I would have never guessed that the ratios had gotten so high. I 
think I know why. Last holiday time I gave a show and tell to a group 
and part of it was titled File Bloat. I started with a simple text file 
that had a size of 475 bytes. Then I showed them the very same text 
saved as various other file types with very simple formatting and at the 
worst case (.odt) the file was 22KB with the exact same text. I had 
showed them printed and on screen examples of each and they were amazed 
at how little they got at the expense of their storage space. It didn't 
occur to me at the time, but I'm guessing that file bloat is a major 
source of the compression ratios you mentioned above.

zstd is substantially more complex than lzo, or zlib, and produces
similar compression ratios to zlib but at a fraction of the CPU
requirement. You can compress and decompress things all day long for
weeks and months and years and 100% of the time get back identical
data bit for bit. That's the point of them. I can't really explain the
math but zstd is free open source software, so it is possible to
inspect it.

https://github.com/facebook/zstd

I've never written a compression algorithm; so I don't know anything 
about their innards, but I think I'll take a peak.

JPEG compression, on the other hand, is intentionally lossy.

I'll try to restrain myself from saying much about that particular 
blight on the land. I've talked to lots of folks about it; even walked 
some through it with examples. Most just don't care. It's doesn't seem 
to matter how bad the picture gets as long as they can recognize at all 
what the picture is of it's okay. The main concern seems to be being 
able to save LOTS of pictures.

I am happy that my still camera has the ability to save Pictures as RAW 
or uncompressed TIF(F). I thought it was a sad day when they added 
compression to TIF.

I'm off hand not aware of any lossy compression algorithms that claim

Back when I was working on those compression analyses there were some 
very hot debates going on about lossless claims. I don't recall which 
ones or if they went to court, but I do recall reading some articles 
about it. As I recall it was in the late '80s.

Anyway, short of
hardware defects, you can compress and decompress data or images using
lossless compression billions of times until the heat death of the
universe and get identical bits out. It's the same as 2+2=4 and 4=2+2.
Same exact information on both sides of the encoding. Anything else is
a hardware error, sunspots, cosmic rays, someone made a mistake in
testing, etc.

I really like your description. I see now that if the compression is 
just removing same value byte strings how it really can be truly 
lossless. As someone who has had to deal with it I'll say that the extra 
intense radiation from sunspots really does matter. and cosmic rays are 
ignored only at peril.

I don't think asking questions is silly or a problem at all. It's the
jumping to conclusions that gave me the frowny face. :-)

I try not to, but I have a lot of "buy in" to Fedora. When I read that 
there's a change coming and it's on a topic I've had some bad experience 
with... I apologize for jumping.

What's considered the meta data. Path to file, file name, file header,
file footer, data layout?

In a file system context, the metadata is the file system itself. The
data is the "payload" of the file, the file contents, the stuff you
actually care about. I mean, you might also care about some of the
metadata: file name, creation/modification date, but that's  probably
incidental to the data. The metadata includes the size of the data,
whether or not it's compressed, its checksum, the inode, owner, group,
posix permissions, selinux label, etc.

Thanks I appreciate the help. Though I've written lots of software. this 
is my first time being involved with the innards of an OS. In my prior 
experience their either wasn't an OS or even an IDE, just my hex code, 
or I just got to take the OS for granted and wrote my code in a IDE. 
Some years back I worked on an AI project for a few years. We had lots 
of discussions about data and metadata. Like what the terms should mean 
and what items belong in each category. One sort of profound conclusion 
we reached is AI won't happen (in the StarTrek sense) until object 
oriented programing really is. And to get real object oriented we must 
give up the Von Neumann model for computers. One of the main things that 
means is that we must stop addressing things by where they are and start 
addressing them by what they are. I think I read once that there was a 
prototype built in hardware someplace and they were just getting started 
with the testing of the hardware. Probably abandon by now. First 
research machines are always very expensive and no one ever wants to 
invest for the long term.

Oh I just noticed crc32c. That's acceptable.

This is the default. It's acceptable for detecting incidental sources
of corruption. Since kernel 5.5 there's also xxhash64, which is about
as fast as crc32c, sometimes faster on some hardware. And for
cryptographic hashing Btrfs offers blake2b (SHA3 runner up) and
sha256. These are mkfs time options.

My experience with crc32 was in a hardware implementation. We needed 
good data integrity and the memory controller chip we chose had crc32 
built in. I forget how many bits we added to each word to save the check 
bits, but I think it was four. I can imagine the uproarious laughter 
that would result if someone at a gathering of PC folks suggested that 
new memory modules should include extra bits to support hardware crc.

	Have a Great Day!

	Pat	(tablepc)
_______________________________________________
test mailing list -- test@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to test-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/test@xxxxxxxxxxxxxxxxxxxxxxx