On 7/21/20 18:06, Chris Murphy wrote:
On Tue, Jul 21, 2020 at 3:36 PM pmkellly@xxxxxxxxxxxx
<pmkellly@xxxxxxxxxxxx> wrote:
The only ones I've ever seen (not a large population since
I've been a compression avoid-er) that approach lossless don't compress
much and only take out strings of the same byte value.
A very simple example is run length encoding.
https://en.wikipedia.org/wiki/Run-length_encoding
That's what I meant by "take out strings of the same byte value". I had
just forgotten the name.
That is variable depending on the source, but quite a lot of human
produced material has a metric F ton of zeros in it, so it turns out
we get a lot of compressibility. This is used by the current zram
default algorithm, as well as lzo which handles the more complex data.
This is typically a 3 to 1 upwards of 4 to 1 compression ratio in my
testing, with a conservative 2 to 1 stated in the proposal.
Wow... I would have never guessed that the ratios had gotten so high. I
think I know why. Last holiday time I gave a show and tell to a group
and part of it was titled File Bloat. I started with a simple text file
that had a size of 475 bytes. Then I showed them the very same text
saved as various other file types with very simple formatting and at the
worst case (.odt) the file was 22KB with the exact same text. I had
showed them printed and on screen examples of each and they were amazed
at how little they got at the expense of their storage space. It didn't
occur to me at the time, but I'm guessing that file bloat is a major
source of the compression ratios you mentioned above.
zstd is substantially more complex than lzo, or zlib, and produces
similar compression ratios to zlib but at a fraction of the CPU
requirement. You can compress and decompress things all day long for
weeks and months and years and 100% of the time get back identical
data bit for bit. That's the point of them. I can't really explain the
math but zstd is free open source software, so it is possible to
inspect it.
https://github.com/facebook/zstd
I've never written a compression algorithm; so I don't know anything
about their innards, but I think I'll take a peak.
JPEG compression, on the other hand, is intentionally lossy.
I'll try to restrain myself from saying much about that particular
blight on the land. I've talked to lots of folks about it; even walked
some through it with examples. Most just don't care. It's doesn't seem
to matter how bad the picture gets as long as they can recognize at all
what the picture is of it's okay. The main concern seems to be being
able to save LOTS of pictures.
I am happy that my still camera has the ability to save Pictures as RAW
or uncompressed TIF(F). I thought it was a sad day when they added
compression to TIF.
I'm off hand not aware of any lossy compression algorithms that claim
Back when I was working on those compression analyses there were some
very hot debates going on about lossless claims. I don't recall which
ones or if they went to court, but I do recall reading some articles
about it. As I recall it was in the late '80s.
Anyway, short of
hardware defects, you can compress and decompress data or images using
lossless compression billions of times until the heat death of the
universe and get identical bits out. It's the same as 2+2=4 and 4=2+2.
Same exact information on both sides of the encoding. Anything else is
a hardware error, sunspots, cosmic rays, someone made a mistake in
testing, etc.
I really like your description. I see now that if the compression is
just removing same value byte strings how it really can be truly
lossless. As someone who has had to deal with it I'll say that the extra
intense radiation from sunspots really does matter. and cosmic rays are
ignored only at peril.
I don't think asking questions is silly or a problem at all. It's the
jumping to conclusions that gave me the frowny face. :-)
I try not to, but I have a lot of "buy in" to Fedora. When I read that
there's a change coming and it's on a topic I've had some bad experience
with... I apologize for jumping.
What's considered the meta data. Path to file, file name, file header,
file footer, data layout?
In a file system context, the metadata is the file system itself. The
data is the "payload" of the file, the file contents, the stuff you
actually care about. I mean, you might also care about some of the
metadata: file name, creation/modification date, but that's probably
incidental to the data. The metadata includes the size of the data,
whether or not it's compressed, its checksum, the inode, owner, group,
posix permissions, selinux label, etc.
Thanks I appreciate the help. Though I've written lots of software. this
is my first time being involved with the innards of an OS. In my prior
experience their either wasn't an OS or even an IDE, just my hex code,
or I just got to take the OS for granted and wrote my code in a IDE.
Some years back I worked on an AI project for a few years. We had lots
of discussions about data and metadata. Like what the terms should mean
and what items belong in each category. One sort of profound conclusion
we reached is AI won't happen (in the StarTrek sense) until object
oriented programing really is. And to get real object oriented we must
give up the Von Neumann model for computers. One of the main things that
means is that we must stop addressing things by where they are and start
addressing them by what they are. I think I read once that there was a
prototype built in hardware someplace and they were just getting started
with the testing of the hardware. Probably abandon by now. First
research machines are always very expensive and no one ever wants to
invest for the long term.
Oh I just noticed crc32c. That's acceptable.
This is the default. It's acceptable for detecting incidental sources
of corruption. Since kernel 5.5 there's also xxhash64, which is about
as fast as crc32c, sometimes faster on some hardware. And for
cryptographic hashing Btrfs offers blake2b (SHA3 runner up) and
sha256. These are mkfs time options.
My experience with crc32 was in a hardware implementation. We needed
good data integrity and the memory controller chip we chose had crc32
built in. I forget how many bits we added to each word to save the check
bits, but I think it was four. I can imagine the uproarious laughter
that would result if someone at a gathering of PC folks suggested that
new memory modules should include extra bits to support hardware crc.
Have a Great Day!
Pat (tablepc)
_______________________________________________
test mailing list -- test@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to test-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/test@xxxxxxxxxxxxxxxxxxxxxxx