Re: Proposal: Faster composes by eliminating deltarpms and using zchunked rpms instead

Jonathan Dieter <jdieter@xxxxxxxxx> · Sat, 17 Nov 2018 21:49:06 +0000

On Sat, 2018-11-17 at 14:36 -0500, Neal Gompa wrote:
> On Sat, Nov 17, 2018 at 1:15 PM Jonathan Dieter <jdieter@xxxxxxxxx> wrote:
> > Neal, thanks so much for your thoughts on this.  Responses inline:
> > 
> > On Sat, 2018-11-17 at 09:53 -0500, Neal Gompa wrote:
> > <snip>
> > > If we're really considering changing the RPM file format, then we need
> > > a proper discussion on rpm-maint@ and rpm-ecosystem@ mailing lists on
> > > rpm.org. Can you please start a targeted discussion there?
> > 
> > Sure.
> > 
> > > But addressing the specific concrete suggestion here, there's a few
> > > concerns I have:
> > > 
> > > 1. This is a huge format break, which means that for the first time in
> > > a _very_ long time, it would not be possible to reuse RHEL for Fedora
> > > infrastructure _at all_. That's going to be a difficult problem.
> > > There's a large legacy of systems that won't be able to handle that
> > > new format, and unfortunately, rpm is not parallel installable in the
> > > same manner as something like GCC or Python currently. Making it
> > > parallel installable *is* possible (I've done it, and there have been
> > > other attempts before), but it's not a supported thing. This is
> > > probably the thing that would trigger a major version bump for RPM,
> > > since it's a new archive format.
> > 
> > Agreed, that this would be a massive format change and should therefore
> > be a major version bump for RPM.  New versions of RPM should still be
> > able to read and install old-format rpms, but, as you point out, old
> > versions of RPM won't be able to read or install new-format rpms.
> > Unfortunately, I don't see any way around this.
> > 
> 
> I don't think there's a way around it either. I just hope we do better
> than the last time someone tried to do this...

+1

> > > 2. This also means the _entire_ ecosystem of RPM archive parsers will
> > > break. This is not particularly insurmountable, actually, as the RPM
> > > file format was not particularly well documented, and a new format is
> > > an opportunity to revisit some of those old issues and try to do a
> > > better job this go around. But it's still a challenge to deal with.
> > 
> > Yes, this is going to be quite a bit of work.
> > 
> > > 3. When you refer to the rpm cpio, I assume you're referring to only
> > > the archive payload, right? Typically the payload is what is
> > > compressed, and the headers are not. It sounds like you're proposing
> > > both aspects to be compressed, and compressed differently. If we made
> > > the RPM header an uncompressed zchunk stream and the RPM payload a
> > > zstd-compressed zchunk stream, would we be able to support fetching
> > > header deltas for retrieving extra information on the fly? Say, for
> > > example, attributes like arch color, filecap properties, and so on,
> > > that aren't in the rpm-md data for things like transaction tests
> > > without the whole RPM?
> > 
> > Yes, I'm referring the the archive payload as the cpio.  The zchunk
> > format supports the idea of separate data streams, and I was planning
> > to use that to put the headers in one stream and the archive payload in
> > another.  If the header chunks are first in the zchunk file, then they
> > could be read without needing to read any of the rest of the file.
> > And, yes, we could make the header stream uncompressed if that made it
> > easier to parse.
> > 
> 
> Whether it's compressed or not isn't terribly important, but what is
> important is being able to validate the correctness before beginning
> any processing, including decompression.

Absolutely!  This includes both the rpm header and the rpm archive
data, and that's why we store both the compressed and uncompressed
checksums of the chunks.

> > > 4. I'd actually rather make it easier for the header streams to be
> > > fetched instead of trying to make specific attributes easier in the
> > > header payload. History has shown that any attempt at foresight here
> > > tends to fail miserably, and common attributes are already specified
> > > in the rpm-md primary.xml anyway, so if you're fetching the header to
> > > retrieve an attribute, you *need* to do something weird anyway.
> > 
> > The main purpose of putting separate attributes in the zchunk header is
> > so programs like 'file' can determine some basic information about an
> > rpm without needing to parse the full rpm header.  This data would also
> > be in the rpm header, so programs that read the rpm header wouldn't
> > care about the attributes in the zchunk header.
> > 
> 
> I see, so some simple hints for stuff like that? But that would still
> require awareness of the format to some degree. I guess we'd have a
> specific lead magic to let tools know to look for them...

Yeah, the code would be maybe a hundred lines, max, that could be
copylib'd into file, etc.

> > > 5. I'm not exactly sure what you mean by zchunk signing...
> > 
> > The zchunk format supports signing, but just for the zchunk header.
> > Because the header contains the checksums for each chunk, this
> > establishes a chain of trust for verifying the whole file.  Which
> > brings me to...
> > 
> > > 6. I'm wondering why we can't do a perfect reconstruction of the
> > > original RPM, given two RPM sources that are both zchunked? We can
> > > pull it off with repodata, so what's different about RPM that makes
> > > that not doable?
> > 
> > The problem is that, unlike the repodata, once an rpm is installed, the
> > package file is deleted and the data is only available on the system in
> > its uncompressed installed form.  If we're trying to use that data to
> > rebuild an rpm, we have two options.
> > 
> >    1. Compress the data using the same method that was used to create the
> >       original rpm.  This is what applydeltarpm does, and is why it's so
> >       heavy on the CPU.
> >    2. Store the data uncompressed in the rebuilt rpm.  This isn't feasible
> >       with deltarpm, but, if we store both compressed hashes and
> >       uncompressed hashes in the zchunk header, we can do this in zchunk.
> >       When running checking the signature, zchunk verifies the header
> >       against the signature first, and then checks each chunk to see if it
> >       passes *either* the compressed or uncompressed signature check.
> > 
> > I hope this makes my thought process on this part clearer.
> > 
> 
> Yeah, that makes sense...

Great!  Thanks again for looking at this.

Jonathan
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx