Re: Proposal: Faster composes by eliminating deltarpms and using zchunked rpms instead

Neal Gompa <ngompa13@xxxxxxxxx> · Sat, 17 Nov 2018 14:36:49 -0500

On Sat, Nov 17, 2018 at 1:15 PM Jonathan Dieter <jdieter@xxxxxxxxx> wrote:
>
> Neal, thanks so much for your thoughts on this.  Responses inline:
>
> On Sat, 2018-11-17 at 09:53 -0500, Neal Gompa wrote:
> <snip>
> > If we're really considering changing the RPM file format, then we need
> > a proper discussion on rpm-maint@ and rpm-ecosystem@ mailing lists on
> > rpm.org. Can you please start a targeted discussion there?
>
> Sure.
>
> > But addressing the specific concrete suggestion here, there's a few
> > concerns I have:
> >
> > 1. This is a huge format break, which means that for the first time in
> > a _very_ long time, it would not be possible to reuse RHEL for Fedora
> > infrastructure _at all_. That's going to be a difficult problem.
> > There's a large legacy of systems that won't be able to handle that
> > new format, and unfortunately, rpm is not parallel installable in the
> > same manner as something like GCC or Python currently. Making it
> > parallel installable *is* possible (I've done it, and there have been
> > other attempts before), but it's not a supported thing. This is
> > probably the thing that would trigger a major version bump for RPM,
> > since it's a new archive format.
>
> Agreed, that this would be a massive format change and should therefore
> be a major version bump for RPM.  New versions of RPM should still be
> able to read and install old-format rpms, but, as you point out, old
> versions of RPM won't be able to read or install new-format rpms.
> Unfortunately, I don't see any way around this.
>

I don't think there's a way around it either. I just hope we do better
than the last time someone tried to do this...

> > 2. This also means the _entire_ ecosystem of RPM archive parsers will
> > break. This is not particularly insurmountable, actually, as the RPM
> > file format was not particularly well documented, and a new format is
> > an opportunity to revisit some of those old issues and try to do a
> > better job this go around. But it's still a challenge to deal with.
>
> Yes, this is going to be quite a bit of work.
>
> > 3. When you refer to the rpm cpio, I assume you're referring to only
> > the archive payload, right? Typically the payload is what is
> > compressed, and the headers are not. It sounds like you're proposing
> > both aspects to be compressed, and compressed differently. If we made
> > the RPM header an uncompressed zchunk stream and the RPM payload a
> > zstd-compressed zchunk stream, would we be able to support fetching
> > header deltas for retrieving extra information on the fly? Say, for
> > example, attributes like arch color, filecap properties, and so on,
> > that aren't in the rpm-md data for things like transaction tests
> > without the whole RPM?
>
> Yes, I'm referring the the archive payload as the cpio.  The zchunk
> format supports the idea of separate data streams, and I was planning
> to use that to put the headers in one stream and the archive payload in
> another.  If the header chunks are first in the zchunk file, then they
> could be read without needing to read any of the rest of the file.
> And, yes, we could make the header stream uncompressed if that made it
> easier to parse.
>

Whether it's compressed or not isn't terribly important, but what is
important is being able to validate the correctness before beginning
any processing, including decompression.

> > 4. I'd actually rather make it easier for the header streams to be
> > fetched instead of trying to make specific attributes easier in the
> > header payload. History has shown that any attempt at foresight here
> > tends to fail miserably, and common attributes are already specified
> > in the rpm-md primary.xml anyway, so if you're fetching the header to
> > retrieve an attribute, you *need* to do something weird anyway.
>
> The main purpose of putting separate attributes in the zchunk header is
> so programs like 'file' can determine some basic information about an
> rpm without needing to parse the full rpm header.  This data would also
> be in the rpm header, so programs that read the rpm header wouldn't
> care about the attributes in the zchunk header.
>

I see, so some simple hints for stuff like that? But that would still
require awareness of the format to some degree. I guess we'd have a
specific lead magic to let tools know to look for them...

> > 5. I'm not exactly sure what you mean by zchunk signing...
>
> The zchunk format supports signing, but just for the zchunk header.
> Because the header contains the checksums for each chunk, this
> establishes a chain of trust for verifying the whole file.  Which
> brings me to...
>
> > 6. I'm wondering why we can't do a perfect reconstruction of the
> > original RPM, given two RPM sources that are both zchunked? We can
> > pull it off with repodata, so what's different about RPM that makes
> > that not doable?
>
> The problem is that, unlike the repodata, once an rpm is installed, the
> package file is deleted and the data is only available on the system in
> its uncompressed installed form.  If we're trying to use that data to
> rebuild an rpm, we have two options.
>
>    1. Compress the data using the same method that was used to create the
>       original rpm.  This is what applydeltarpm does, and is why it's so
>       heavy on the CPU.
>    2. Store the data uncompressed in the rebuilt rpm.  This isn't feasible
>       with deltarpm, but, if we store both compressed hashes and
>       uncompressed hashes in the zchunk header, we can do this in zchunk.
>       When running checking the signature, zchunk verifies the header
>       against the signature first, and then checks each chunk to see if it
>       passes *either* the compressed or uncompressed signature check.
>
> I hope this makes my thought process on this part clearer.
>

Yeah, that makes sense...

-- 
真実はいつも一つ！/ Always, there's only one truth!
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx