Re: agenda for todays QA meeting

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Wed, 22 Jul 2020 15:10:20 -0600

On Wed, Jul 22, 2020 at 8:54 AM pmkellly@xxxxxxxxxxxx
<pmkellly@xxxxxxxxxxxx> wrote:
>
>
>
> On 7/21/20 18:06, Chris Murphy wrote:

> > That is variable depending on the source, but quite a lot of human
> > produced material has a metric F ton of zeros in it, so it turns out
> > we get a lot of compressibility. This is used by the current zram
> > default algorithm, as well as lzo which handles the more complex data.
> > This is typically a 3 to 1 upwards of 4 to 1 compression ratio in my
> > testing, with a conservative 2 to 1 stated in the proposal.
> >
>
> Wow... I would have never guessed that the ratios had gotten so high. I
> think I know why. Last holiday time I gave a show and tell to a group
> and part of it was titled File Bloat. I started with a simple text file
> that had a size of 475 bytes. Then I showed them the very same text
> saved as various other file types with very simple formatting and at the
> worst case (.odt) the file was 22KB with the exact same text. I had
> showed them printed and on screen examples of each and they were amazed
> at how little they got at the expense of their storage space. It didn't
> occur to me at the time, but I'm guessing that file bloat is a major
> source of the compression ratios you mentioned above.

I'm not super familiar with the math involved but I've read from zstd
and xz source materials that sophisticated algorithms depend on having
more data available to do a good job of compressing. And that's why
for small data sets they have the option to build a dictionary using a
training mode on a data set, that gives the algorithm something of a
head start with knowing about redundancies in that data set.

> > JPEG compression, on the other hand, is intentionally lossy.
>
> I'll try to restrain myself from saying much about that particular
> blight on the land. I've talked to lots of folks about it; even walked
> some through it with examples. Most just don't care. It's doesn't seem
> to matter how bad the picture gets as long as they can recognize at all
> what the picture is of it's okay. The main concern seems to be being
> able to save LOTS of pictures.
>
> I am happy that my still camera has the ability to save Pictures as RAW
> or uncompressed TIF(F). I thought it was a sad day when they added
> compression to TIF.

It depends on the compression. TIFF supports arbitrary compression (it
just gets added as a new tag, so what's supported is app specific),
but commonly supported algorithms for TIFF are JPEG, ZIP, and LZO. The
first is lossy. The second two are lossless. You can decompress and
recompress a billion times back and forth between ZIP and LZO and
always get back identical bits to the original. (Bugs and hardware
anomalies excluded because those can also hit uncompressed data.)

>
> > I'm off hand not aware of any lossy compression algorithms that claim
>
> Back when I was working on those compression analyses there were some
> very hot debates going on about lossless claims. I don't recall which
> ones or if they went to court, but I do recall reading some articles
> about it. As I recall it was in the late '80s.

[Virtually, Visually, Effectively, Essentially] lossless? Yes, I take
a very dim view of this.

>
> > Anyway, short of
> > hardware defects, you can compress and decompress data or images using
> > lossless compression billions of times until the heat death of the
> > universe and get identical bits out. It's the same as 2+2=4 and 4=2+2.
> > Same exact information on both sides of the encoding. Anything else is
> > a hardware error, sunspots, cosmic rays, someone made a mistake in
> > testing, etc.
>
> I really like your description. I see now that if the compression is
> just removing same value byte strings how it really can be truly
> lossless. As someone who has had to deal with it I'll say that the extra
> intense radiation from sunspots really does matter. and cosmic rays are
> ignored only at peril.

Oh for sure. On all counts.

> > I don't think asking questions is silly or a problem at all. It's the
> > jumping to conclusions that gave me the frowny face. :-)
> >
>
> I try not to, but I have a lot of "buy in" to Fedora. When I read that
> there's a change coming and it's on a topic I've had some bad experience
> with... I apologize for jumping.

Remain sceptical! I do not want Fedora users bitten for any reason,
but we know this is going to happen because they already do get bitten
from time to time. It's just that we're used to that pattern. And the
big changes including Btrfs come with a certain amount of "exchanging
problems that we know for problems that we don't know." So we have to
learn them. Fortunately the Btrfs change owners have been using it for
a long time and are familiar with where the bodies are buried. That is
not exactly the best descriptive material for a Fedora Magazine
article. :D

But for testers, they're necessarily going to get hammered a bit
harder with changes, no matter how transparent they're intended to be
for regular users, because testers want to understand the problem well
enough to know that it is a problem, whether it may be blocker, etc.

So yeah, it's reasonable to be sceptical of the change and to be
critical if something is really obviously not transparent.

> >> What's considered the meta data. Path to file, file name, file header,
> >> file footer, data layout?
> >
> > In a file system context, the metadata is the file system itself. The
> > data is the "payload" of the file, the file contents, the stuff you
> > actually care about. I mean, you might also care about some of the
> > metadata: file name, creation/modification date, but that's  probably
> > incidental to the data. The metadata includes the size of the data,
> > whether or not it's compressed, its checksum, the inode, owner, group,
> > posix permissions, selinux label, etc.
> >
>
> Thanks I appreciate the help. Though I've written lots of software. this
> is my first time being involved with the innards of an OS. In my prior
> experience their either wasn't an OS or even an IDE, just my hex code,
> or I just got to take the OS for granted and wrote my code in a IDE.
> Some years back I worked on an AI project for a few years. We had lots
> of discussions about data and metadata. Like what the terms should mean
> and what items belong in each category. One sort of profound conclusion
> we reached is AI won't happen (in the StarTrek sense) until object
> oriented programing really is. And to get real object oriented we must
> give up the Von Neumann model for computers. One of the main things that
> means is that we must stop addressing things by where they are and start
> addressing them by what they are. I think I read once that there was a
> prototype built in hardware someplace and they were just getting started
> with the testing of the hardware. Probably abandon by now. First
> research machines are always very expensive and no one ever wants to
> invest for the long term.
>
> >> Oh I just noticed crc32c. That's acceptable.
> >
> > This is the default. It's acceptable for detecting incidental sources
> > of corruption. Since kernel 5.5 there's also xxhash64, which is about
> > as fast as crc32c, sometimes faster on some hardware. And for
> > cryptographic hashing Btrfs offers blake2b (SHA3 runner up) and
> > sha256. These are mkfs time options.
> >
>
> My experience with crc32 was in a hardware implementation. We needed
> good data integrity and the memory controller chip we chose had crc32
> built in. I forget how many bits we added to each word to save the check
> bits, but I think it was four. I can imagine the uproarious laughter
> that would result if someone at a gathering of PC folks suggested that
> new memory modules should include extra bits to support hardware crc.

On Btrfs, it's 4 bytes of crc32c per 4KiB data block. And at least by
default it's 4 bytes of crc32c per 16KiB metadata node/leaf block. Max
metadata block size is 64KiB. Computationally it's negligible latency,
even without hardware acceleration support. In some workloads it can
show up in IO latency.

-- 
Chris Murphy
_______________________________________________
test mailing list -- test@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to test-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/test@xxxxxxxxxxxxxxxxxxxxxxx