Re: Proposal: Faster composes by eliminating deltarpms and using zchunked rpms instead

Neal Gompa <ngompa13@xxxxxxxxx> · Sat, 17 Nov 2018 09:53:00 -0500

On Fri, Nov 16, 2018 at 6:03 PM Jonathan Dieter <jdieter@xxxxxxxxx> wrote:
>
>
> *Changes*
> The zchunk format would need to be extended to allow for a zchunked rpm
> to contain both the uncompressed chunks that were already on the local
> system and the newly downloaded compressed chunks while still passing
> signature verification.  This would also require moving signature
> verification to zchunk.
>
> The rpm file format has to be changed because the zchunk header needs
> to be at the beginning of the file in order for the zchunk library
> figure out which chunks it needs to download.  My suggestions for
> changes to the rpm file format are as follows:
>
>    1. Signing should be moved to the zchunk format as described at the
>       beginning of this section
>    2. The rpm header should be stored in one stream inside the zchunk
>       file.  This allows it to be easily extracted separately from the
>       data
>    3. The rpm cpio should be stored in a second stream inside the zchunk
>       file.
>    4. At minimum, an optional zchunk element should be set to identify
>       zchunk rpms as rpms rather than regular zchunk files.  If desired,
>       optional elements could also be set containing %{name}, %[version},
>       %{release}, %{arch} and %{epoch}.  This would allow this information
>       to be read easily without needing to extract the rpm header stream.
>
> *Final notes*
> I realize this is a massive proposal, zchunk is still very young, and
> we're still working on getting the dnf zchunk pull requests reviewed.
> I do think it's feasible and provides an opportunity to eliminate a
> pain point from our compose process while still reducing the download
> size for our users.
>

If we're really considering changing the RPM file format, then we need
a proper discussion on rpm-maint@ and rpm-ecosystem@ mailing lists on
rpm.org. Can you please start a targeted discussion there?

But addressing the specific concrete suggestion here, there's a few
concerns I have:

1. This is a huge format break, which means that for the first time in
a _very_ long time, it would not be possible to reuse RHEL for Fedora
infrastructure _at all_. That's going to be a difficult problem.
There's a large legacy of systems that won't be able to handle that
new format, and unfortunately, rpm is not parallel installable in the
same manner as something like GCC or Python currently. Making it
parallel installable *is* possible (I've done it, and there have been
other attempts before), but it's not a supported thing. This is
probably the thing that would trigger a major version bump for RPM,
since it's a new archive format.

2. This also means the _entire_ ecosystem of RPM archive parsers will
break. This is not particularly insurmountable, actually, as the RPM
file format was not particularly well documented, and a new format is
an opportunity to revisit some of those old issues and try to do a
better job this go around. But it's still a challenge to deal with.

3. When you refer to the rpm cpio, I assume you're referring to only
the archive payload, right? Typically the payload is what is
compressed, and the headers are not. It sounds like you're proposing
both aspects to be compressed, and compressed differently. If we made
the RPM header an uncompressed zchunk stream and the RPM payload a
zstd-compressed zchunk stream, would we be able to support fetching
header deltas for retrieving extra information on the fly? Say, for
example, attributes like arch color, filecap properties, and so on,
that aren't in the rpm-md data for things like transaction tests
without the whole RPM?

4. I'd actually rather make it easier for the header streams to be
fetched instead of trying to make specific attributes easier in the
header payload. History has shown that any attempt at foresight here
tends to fail miserably, and common attributes are already specified
in the rpm-md primary.xml anyway, so if you're fetching the header to
retrieve an attribute, you *need* to do something weird anyway.

5. I'm not exactly sure what you mean by zchunk signing...

6. I'm wondering why we can't do a perfect reconstruction of the
original RPM, given two RPM sources that are both zchunked? We can
pull it off with repodata, so what's different about RPM that makes
that not doable?

-- 
真実はいつも一つ！/ Always, there's only one truth!
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx