Re: agenda for todays QA meeting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 7/21/20 18:06, Chris Murphy wrote:
On Tue, Jul 21, 2020 at 3:36 PM pmkellly@xxxxxxxxxxxx
<pmkellly@xxxxxxxxxxxx> wrote:
The only ones I've ever seen (not a large population since
I've been a compression avoid-er) that approach lossless don't compress
much and only take out strings of the same byte value.

A very simple example is run length encoding.
https://en.wikipedia.org/wiki/Run-length_encoding


That's what I meant by "take out strings of the same byte value". I had just forgotten the name.

That is variable depending on the source, but quite a lot of human
produced material has a metric F ton of zeros in it, so it turns out
we get a lot of compressibility. This is used by the current zram
default algorithm, as well as lzo which handles the more complex data.
This is typically a 3 to 1 upwards of 4 to 1 compression ratio in my
testing, with a conservative 2 to 1 stated in the proposal.


Wow... I would have never guessed that the ratios had gotten so high. I think I know why. Last holiday time I gave a show and tell to a group and part of it was titled File Bloat. I started with a simple text file that had a size of 475 bytes. Then I showed them the very same text saved as various other file types with very simple formatting and at the worst case (.odt) the file was 22KB with the exact same text. I had showed them printed and on screen examples of each and they were amazed at how little they got at the expense of their storage space. It didn't occur to me at the time, but I'm guessing that file bloat is a major source of the compression ratios you mentioned above.

zstd is substantially more complex than lzo, or zlib, and produces
similar compression ratios to zlib but at a fraction of the CPU
requirement. You can compress and decompress things all day long for
weeks and months and years and 100% of the time get back identical
data bit for bit. That's the point of them. I can't really explain the
math but zstd is free open source software, so it is possible to
inspect it.

https://github.com/facebook/zstd


I've never written a compression algorithm; so I don't know anything about their innards, but I think I'll take a peak.


JPEG compression, on the other hand, is intentionally lossy.

I'll try to restrain myself from saying much about that particular blight on the land. I've talked to lots of folks about it; even walked some through it with examples. Most just don't care. It's doesn't seem to matter how bad the picture gets as long as they can recognize at all what the picture is of it's okay. The main concern seems to be being able to save LOTS of pictures.

I am happy that my still camera has the ability to save Pictures as RAW or uncompressed TIF(F). I thought it was a sad day when they added compression to TIF.

I'm off hand not aware of any lossy compression algorithms that claim

Back when I was working on those compression analyses there were some very hot debates going on about lossless claims. I don't recall which ones or if they went to court, but I do recall reading some articles about it. As I recall it was in the late '80s.

Anyway, short of
hardware defects, you can compress and decompress data or images using
lossless compression billions of times until the heat death of the
universe and get identical bits out. It's the same as 2+2=4 and 4=2+2.
Same exact information on both sides of the encoding. Anything else is
a hardware error, sunspots, cosmic rays, someone made a mistake in
testing, etc.

I really like your description. I see now that if the compression is just removing same value byte strings how it really can be truly lossless. As someone who has had to deal with it I'll say that the extra intense radiation from sunspots really does matter. and cosmic rays are ignored only at peril.


I don't think asking questions is silly or a problem at all. It's the
jumping to conclusions that gave me the frowny face. :-)


I try not to, but I have a lot of "buy in" to Fedora. When I read that there's a change coming and it's on a topic I've had some bad experience with... I apologize for jumping.

What's considered the meta data. Path to file, file name, file header,
file footer, data layout?

In a file system context, the metadata is the file system itself. The
data is the "payload" of the file, the file contents, the stuff you
actually care about. I mean, you might also care about some of the
metadata: file name, creation/modification date, but that's  probably
incidental to the data. The metadata includes the size of the data,
whether or not it's compressed, its checksum, the inode, owner, group,
posix permissions, selinux label, etc.


Thanks I appreciate the help. Though I've written lots of software. this is my first time being involved with the innards of an OS. In my prior experience their either wasn't an OS or even an IDE, just my hex code, or I just got to take the OS for granted and wrote my code in a IDE. Some years back I worked on an AI project for a few years. We had lots of discussions about data and metadata. Like what the terms should mean and what items belong in each category. One sort of profound conclusion we reached is AI won't happen (in the StarTrek sense) until object oriented programing really is. And to get real object oriented we must give up the Von Neumann model for computers. One of the main things that means is that we must stop addressing things by where they are and start addressing them by what they are. I think I read once that there was a prototype built in hardware someplace and they were just getting started with the testing of the hardware. Probably abandon by now. First research machines are always very expensive and no one ever wants to invest for the long term.

Oh I just noticed crc32c. That's acceptable.

This is the default. It's acceptable for detecting incidental sources
of corruption. Since kernel 5.5 there's also xxhash64, which is about
as fast as crc32c, sometimes faster on some hardware. And for
cryptographic hashing Btrfs offers blake2b (SHA3 runner up) and
sha256. These are mkfs time options.


My experience with crc32 was in a hardware implementation. We needed good data integrity and the memory controller chip we chose had crc32 built in. I forget how many bits we added to each word to save the check bits, but I think it was four. I can imagine the uproarious laughter that would result if someone at a gathering of PC folks suggested that new memory modules should include extra bits to support hardware crc.

	Have a Great Day!

	Pat	(tablepc)
_______________________________________________
test mailing list -- test@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to test-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/test@xxxxxxxxxxxxxxxxxxxxxxx




[Index of Archives]     [Fedora Desktop]     [Fedora SELinux]     [Photo Sharing]     [Yosemite Forum]     [KDE Users]

  Powered by Linux