Re: Proposal: Faster composes by eliminating deltarpms and using zchunked rpms instead

Jonathan Dieter <jdieter@xxxxxxxxx> · Sat, 17 Nov 2018 17:24:30 +0000

Neal, thanks so much for your thoughts on this.  Responses inline:

On Sat, 2018-11-17 at 09:53 -0500, Neal Gompa wrote:
<snip>
> If we're really considering changing the RPM file format, then we need
> a proper discussion on rpm-maint@ and rpm-ecosystem@ mailing lists on
> rpm.org. Can you please start a targeted discussion there?

Sure.

> But addressing the specific concrete suggestion here, there's a few
> concerns I have:
> 
> 1. This is a huge format break, which means that for the first time in
> a _very_ long time, it would not be possible to reuse RHEL for Fedora
> infrastructure _at all_. That's going to be a difficult problem.
> There's a large legacy of systems that won't be able to handle that
> new format, and unfortunately, rpm is not parallel installable in the
> same manner as something like GCC or Python currently. Making it
> parallel installable *is* possible (I've done it, and there have been
> other attempts before), but it's not a supported thing. This is
> probably the thing that would trigger a major version bump for RPM,
> since it's a new archive format.

Agreed, that this would be a massive format change and should therefore
be a major version bump for RPM.  New versions of RPM should still be
able to read and install old-format rpms, but, as you point out, old
versions of RPM won't be able to read or install new-format rpms. 
Unfortunately, I don't see any way around this.

> 2. This also means the _entire_ ecosystem of RPM archive parsers will
> break. This is not particularly insurmountable, actually, as the RPM
> file format was not particularly well documented, and a new format is
> an opportunity to revisit some of those old issues and try to do a
> better job this go around. But it's still a challenge to deal with.

Yes, this is going to be quite a bit of work.

> 3. When you refer to the rpm cpio, I assume you're referring to only
> the archive payload, right? Typically the payload is what is
> compressed, and the headers are not. It sounds like you're proposing
> both aspects to be compressed, and compressed differently. If we made
> the RPM header an uncompressed zchunk stream and the RPM payload a
> zstd-compressed zchunk stream, would we be able to support fetching
> header deltas for retrieving extra information on the fly? Say, for
> example, attributes like arch color, filecap properties, and so on,
> that aren't in the rpm-md data for things like transaction tests
> without the whole RPM?

Yes, I'm referring the the archive payload as the cpio.  The zchunk
format supports the idea of separate data streams, and I was planning
to use that to put the headers in one stream and the archive payload in
another.  If the header chunks are first in the zchunk file, then they
could be read without needing to read any of the rest of the file. 
And, yes, we could make the header stream uncompressed if that made it
easier to parse.

> 4. I'd actually rather make it easier for the header streams to be
> fetched instead of trying to make specific attributes easier in the
> header payload. History has shown that any attempt at foresight here
> tends to fail miserably, and common attributes are already specified
> in the rpm-md primary.xml anyway, so if you're fetching the header to
> retrieve an attribute, you *need* to do something weird anyway.

The main purpose of putting separate attributes in the zchunk header is
so programs like 'file' can determine some basic information about an
rpm without needing to parse the full rpm header.  This data would also
be in the rpm header, so programs that read the rpm header wouldn't
care about the attributes in the zchunk header.

> 5. I'm not exactly sure what you mean by zchunk signing...

The zchunk format supports signing, but just for the zchunk header. 
Because the header contains the checksums for each chunk, this
establishes a chain of trust for verifying the whole file.  Which
brings me to...

> 6. I'm wondering why we can't do a perfect reconstruction of the
> original RPM, given two RPM sources that are both zchunked? We can
> pull it off with repodata, so what's different about RPM that makes
> that not doable?

The problem is that, unlike the repodata, once an rpm is installed, the
package file is deleted and the data is only available on the system in
its uncompressed installed form.  If we're trying to use that data to
rebuild an rpm, we have two options.

   1. Compress the data using the same method that was used to create the
      original rpm.  This is what applydeltarpm does, and is why it's so
      heavy on the CPU.
   2. Store the data uncompressed in the rebuilt rpm.  This isn't feasible
      with deltarpm, but, if we store both compressed hashes and
      uncompressed hashes in the zchunk header, we can do this in zchunk. 
      When running checking the signature, zchunk verifies the header
      against the signature first, and then checks each chunk to see if it
      passes *either* the compressed or uncompressed signature check.  

I hope this makes my thought process on this part clearer.

Jonathan
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx