Re: Proposal: Faster composes by eliminating deltarpms and using zchunked rpms instead

John Reiser <jreiser@xxxxxxxxxxxx> · Thu, 22 Nov 2018 07:36:48 -0800

On 2018-11-16, Jonathan Dieter wrote:
For reference, this is in reply to Paul [Frield]'s email about lifecycle
objectives, specifically focusing on problem statement #1[1].

<tl;dr>
Have rpm use zchunk as its compression format, removing the need for
deltarpms, and thus reducing compose time.  This will require changes
to both the rpm format and new features in the zchunk format.
</tl;dr>

[1]:
https://fedoraproject.org/wiki/Objectives/Lifecycle/Problem_statements#Challenge_.231:_Faster.2C_more_scalable_composes

Currently a compose takes a minimum of around 8.5 hours ([1] and others);
the goal is 1 hour.  The goal is particularly relevant during the last
phase of a Fedora release cycle (after code freeze) when each successive
compose contains only a few .rpms that have changed from the previous
compose, and the question-of-the-hour is whether some particular bug
actually was fixed.  In this case deltarpms can be ignored.
The goal also is relevant to a future of CI (Continuous Integration)
that has automated gating of changes depending on successful tests
of the entire compose ("Does it boot and pass the test cases?")
Again, deltarpms can be ignored.

Please display some measurements which support the belief
that using zchunk will reduce compose time dramatically,
whether by eliminating deltarpms or by other effects.

Did you view
    https://www.youtube.com/watch?v=kW7oz_zbSD0
    "Flock 2018 - Improving Fedora Compose process" (Aug.8, 2018; 55min)
They do present measurements [coarse].  The overwhelming
conclusion is that 8.5 hours is a data flow problem, both
large-grain (moving .rpms and other files across the network)
and small-grain (extracting the desired information from
the header of an .rpm that uses data compression.)

The number one request that I heard in the recorded session
was for faster access to fields in the header of an .rpm
that uses data compression.  This is slow today because the
header+tail are compressed together as if a single logical stream,
and the code retrieves and de-compresses the whole .rpm in order to access
just the header.  However, both xz (liblzma)  and gzip (zlib) accept
a parameter to stop decompressing after generating N bytes of output;
why not use this?  N can be known, or over-estimated, or iteratively
(and incrementally) approximated until it covers the entire header.
To make de-compression of the rpm header even easier,
call xz_compress twice: once with the header, once with the tail.
The concatenation of the compressed outputs is transparent
by default but visible if you look for it, just like zlib.

In effect the "directory" feature of zchunk can be implemented
for the special case of header-vs-tail (using either xz (liblzma)
or gzip (zlib)) without disturbing other clients of .rpms.
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx