Re: Announcing DNF 3 development

Nico Kadel-Garcia <nkadel@xxxxxxxxx> · Thu, 22 Mar 2018 18:12:43 -0400

On Thu, Mar 22, 2018 at 5:49 PM, John Reiser <jreiser@xxxxxxxxxxxx> wrote:
> On 03/22/2018 01:51 PM, Nico Kadel-Garcia wrote:
>>
>> On Thu, Mar 22, 2018 at 10:52 AM, John Reiser <jreiser@xxxxxxxxxxxx>
>> wrote:
>>>
>>> On 03/22/2018 05:40 AM, Daniel Mach wrote:
>>>>
>>>>
>>>> We are pleased to announce that development of DNF 3 has started. This
>>>> version is focused on performance improvements, new API and
>>>> consolidating
>>>> the whole software management stack.
>>>
>>>
>>>
>>> How does RPM fit into DNF's view of "the whole software management
>>> stack"?
>>> RPM is a slug (moves very slowly): no parallelism (at any point all
>>> packages
>>> with no remaining predecessors could be updated/installed in parallel),
>>> not even manually pipelined (decompress to memory, manipulate filesystem,
>>> update database.)
>>
>>
>> Parallelizing software updates or installations would be *begging* for
>> pain. It would be difficult for me to recommend strongly enough
>> against this.
>
>
> Please be specific about the pain points that you fear.

RPM, itself, is single threaded.

%pre and %post operations would have to be re-evaluated for
parallelization. system account creation, in particular, would have to
be made thread safe.

RPM installation can fail partly through deployment due to SELinux,
disk space, or network based mount point failure: keeping it single
threaded makes it much safer to unravel failed or partial RPM
installation.

Unweaving partial dependency deployment could be quite destructive
with a parallelized approach.

Daemons that need to be restarted and may have incompatible component
updates, such httpd with its modules, are particularly vulnerable to
fascinating failures from the daemon restarting with only some updated
components. Avoiding that would seem to require even more dependency
management for RPM installation, rather than each update itself
triggering an update.

> The three-stage "manual" pipeline achieves 2x to 3x faster throughput
> with error states that are isomorphic to present RPM.  (Consider the
> Turning machine model: if you don't write to the filesystem, then
> there is no change of external state.)

Turing machines don't have to deal with all the possible

> The "parallelize everything that has no remaining predecessors" strategy
> requires parallel transactions in the database (they cannot interfere
> because that would be a predecessor constraint) and checking for
> resource exhaustion (file space, inodes, etc.) as a global
> predecessor constraint.  What else?

Parallelizing the installations means losing the milestones at which
one update has succeeded, and the second update has not. Unweaving
that to find out which update triggered the failure sounds like pain,
and makes testing the update process more difficult. It becomes
difficult to manage or guess what the state of the system was at the
time of the RPM update, since another RPM update may be in progress at
the time.

There is an infamous quote by Donald Knuth that "premature
optimization is the root of all evil". There are systems that benefit
the time benefits of parallelization, but for ordinary RPM
installations and system updates, I think that the slow update time is
because of other factors, such as disk IO and download time of
repodata, RPM database updates, and download times for the packages.
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx