[Yum] [UG] parallelizing downloading

rgb at phy.duke.edu (Robert G. Brown) · Thu Jun 30 12:28:00 2005

Michael Stenner writes:

> I'm actually open to picking this issue up again if there's interest.
> We imagined a "batch grabber" which would be a grabber wrapper object
> much like the mirrorgroup stuff, but which would take a list of files
> (perhaps a queue for pipelining applications) and go to town on them.

It might be good to first determine the "advantage" of parallel
downloads and the number of systems that would benefit.  Presumably,
parallel downloads are "good" because they make things run a bit faster
in cases where one is bandwidth-limited at the server side (so your
local NIC is spending time idle that could be invested on other
connections).

However, I'll wager that for MOST users of yum, they are either on a
reasonably local network connection directly to a professionally
administered server with plenty of server-side bandwidth (so that shared
or not, little benefit accrues from the parallelization) or they are
sitting at the end of e.g. a DSL or cable link, with a peak bandwidth of
a few hundred KB/sec.  In this latter case there is ALSO little benefit
in most cases, as one is most likely bandwidth throttled at your end of
the connection and not the server side. Parallelization might well
actually slow one down, as handling four connections in parallel
generally takes LONGER than the same four connections serialized if the
servers aren't resource constrained.

I'd argue that the ones who >>will<< benefit are mostly edge cases --
people who are running systems that aren't in a professionally
administered LAN with its own local mirror(s), and that are resource
starved (and can't afford e.g. the disk to just do a local mirror and
sync it up in off-peak hours and be done with it), that use toplevel
repositories that ARE hammered enough to underserve even DSL bandwidth
instead of a mirror with more outgoing capacity.  Are there enough of
them that this is a serious issue and worth the hassle?  Aren't there
enough alternative solutions that they should be considering already?

This kind of scaling/performance analysis is the downside to
parallelization in general.  Parallelization only "improves" things if
the critical bottlenecks are in just the right places.  Otherwise it
actually slows things down and wastes resources.  Most users will not be
sophisticated enough to know which category they are in (which could
even change in time during a single download as a server goes from
IMMEDIATELY being overloaded back to IMMEDIATELY being underloaded).
This means that you have to make your tool smart in direct proportion to
the ignorance of your users and stop doing a network transfer in
parallel if in fact a single site is saturating your incoming bandwidth
already and smoothly continue with other pending download transactions
if the site momentarily slows.  Being careful not to resource starve any
particular connection, etc. Bleah.  

And in the end, you will USUALLY save at most a relatively few seconds,
at the risk of actually taking longer.  Serialized connections/downloads
are so much simpler and performance limitations are easy to understand
and work around.

All of these suggestions (parallelization, BT) seem to be directed at at
relatively few systems to enable them to make efficient use of one or
another "critical" resource or avoid some particular bottleneck.  In
particular, either client-side bandwidth or local disk consumption.

I honestly think that the issue of client-side and server-side bandwidth
(and the efficient utilization of same) needs to be reexamined in light
of the brave new world of huge client disks.  Disk is now SO cheap that
yum could actually benefit from having an automirror mode that could be
enabled by e.g.

automirror_path=/yum_mirror/fc$releasever/$basearch
automirror=1
automirror_all=1

in a repo file.  If automirror=1, running "yum update mirror" would
cause yum to rsync the baseurl to the local path.  If automirror_all
were 1, it would do the whole thing (the right thing to do for base and
updates) and make a true mirror; otherwise it would just do installed
files and deliberately NOT do the whole repository (the right thing to
do for just a few packages snarfed from an "extras" repository,
especially one that largely is redundant with one of the base or updates
repositories in your primary trusted url paths).

For repositories that were automirrored, all other yum activities would
be addressed out of the local mirror, if possible, without checking the
remote repository at all, if possible.  So a yum update wouldn't look at
the remote site at all, yum list wouldn't look at a remote site at all.
Presumably one would add "yum update mirror" to the nightly yum cron,
run well before yum update.  It would be especially nice if the yum
update mirror had some way of proxying so that one could update a mirror
of an internally trusted repo from outside the trusted network with the
right credentials.

This is just what lots of us have implemented anyway, as Sean noted,
except that we don't use "yum commands" to do the mirroring/syncing and
we all have to write our own scripts to accomplish pretty much the same
thing.  This is a lot of duplicated work, which is usually a pretty good
signal that it is time to become a "feature" instead of a custom-crafted
add-on.

Note well that functionally, this is only a LITTLE more than what yum
does now, except that instead of caching rpms it would put the rpms into
the local mirror path and "complete" the yummification of the result
into a repository according to the given path (and could probably
replace /var/yum/cache with a suitable symlink in the process).  In a
home DSL-bottlenecked LAN, running this on a single host and exporting
it to the other hosts via NFS or httpd aas an actual repo would be even
more conservative of bandwidth and enhancing of performance.

This sort of thing is likely, in my opinion, to have far greater
positive impact on bandwidth bottlenecks than either parallelization or
BT.  Nothing beats local repository mirrors for performance and ease of
use.  Installing them also reduces server load on the toplevel servers
and is a socially conscious thing to do so that there is more bandwidth
left to serve the edge cases that can't or don't do their own mirrors.
Currently, devoting even 10 GB to a local mirror (especially a local
mirror that is then exported to a local LAN) costs what, $5 or $10?
Divided by the number of hosts that use it in a LAN of any sort?  

This is just a no brainer in most environments, although sure there are
going to be laptops or old PCs where disk is still a "scarce resource".
The main reason that EVERYBODY doesn't do this (or nearly so) is that it
isn't "easy" to do a local mirror without knowing something or following
a fairly complex set of directions.  

I personally think that it isn't terribly unreasonable to EXPECT yum to
run more slowly and inconveniently on resource-starved older systems.
Everything else does too, after all -- this is why we call them
"resource-starved older systems".  Rewriting a complex tool to be even
more complex to provide a small marginal benefit for those systems seems
like a poor use of scarce developer time.

     rgb
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.dulug.duke.edu/pipermail/yum/attachments/20050630/49eceb9d/attachment.bin