[Yum] [UG] parallelizing downloading

rgb at phy.duke.edu (Robert G. Brown) · Thu Jun 30 20:05:23 2005

Brian Long writes:

> Even on 100BaseT or Gigabit LANs inside the same DC, parallel downloads
> reduce the time it takes a sysadmin to patch their linux host.  When we

Impossible, IF the the servers aren't overloaded or incorrectly
organized.  You have a certain amount of bandwidth into a host and out
of the server.  If these two are matched, parallel downloads aren't a
net win.  Only if a) the rpms are split up among multiple servers and;
b) the server for one part of the rpm set needed by the hosts is
overloaded is there a net reduction in time required to install multiple
hosts, and in this case equivalent reductions can generally be obtained
by reorganizing the servers so that one server doesn't sit idle while
another works.

> have farms of linux hosts, reduction in patching time is a huge
> productivity gain.  Consider the fact that we have RPM'ized Oracle 9i.
> While the 1.2GB Oracle server RPM is getting downloaded, it sure would
> be nice if 4 other packages were getting downloaded at the same
> time.  :-)

Only if there is idle time on the client where a single server is
serving multiple hosts while another server is idle, forming a true
bottleneck (wasting server bandwidth).  In NO case will an organization
that keeps all your servers fully loaded in a serialized install work
more slowly.  If your task organization is poor, then splitting the
client load among N servers (each with the same set of files they can
provide) and doing it serialized will always complete the TOTAL job
slightly faster than any client-side parallelization.  This is important
in a LAN (where you control the servers).  In a WAN, of course, where
the server load and redundancy is beyond your control there can be
advantage to parallelization.

Remember, the LAN SERVERS are ALREADY de facto parallelized.  They
provide files to N hosts in parallel if the hosts are connected and
requesting files.  They use as much as 100% of their bandwidth (less
delays due to CPU processing the requests) already.  You simply cannot
get all the files installed any faster than the servers can provide
them, working at full network capacity all the time, and getting this to
happen is a matter of task organization, not parallelization of
connections especially on the client side.  Insisting on client side
parallelization masks carelessness in setting up your servers.

There are lots of ways to set things up in a LAN environment so that
servers are always 100% loaded.  Most of them will be MORE efficient
than parallelizing the clients, since specifying particular servers for
particular package sets means that THOSE servers will be idle for at
least part of the install period unless you precisely balance server
load (difficult when parallelizing CLIENT SIDE activity).  Yum is
already server-parallelized to some extent, so that if you set up
multiple servers, limit the number of simultaneous connections per
server, and have fallback sets of server URLs you should be able to keep
a server farm working at (close to) 100% during any sort of multiclient
install or update.

You cannot do better than 100% utilization on the SERVER side, as this
is the fundamental bottleneck.  You can easily do worse on the client
side.  It is easy to lose track of this because of course any given
client will appear to be idle waiting for servers at any given time, but
if all the servers are running flat out, who cares?  You cannot do any
better without adding more servers and/or bandwidth.

Do you understand what I'm saying, here, or should I state it in more
detail?

> We are deploying yum repos on load-balanced web servers and we're also
> planning to use existing Cisco Content Engines across the globe to cache
> our content.  Parallel downloads would be very nice in our environment.

If the web servers load balance, each has the entire repository set you
are installing from, and are always (due to SERVER side parallelization
and load balancing) running at peak capacity, you won't be able to
install any faster with client-side parallel downloads.  It's a matter
of simple arithmetic.  The best you can do timewise is (bytes to be
installed/bytes per second bandwidth/capacity of network).  Any scheme
that really DOES load balance and keeps your network running at capacity
into the clients will yield very much the same amount of time, and
parallizing both server side and client side is redundant at best and
will actually not work the way you think it will at worst.

   rgb

> 
> /Brian/
> -- 
>        Brian Long                      |         |           |
>        IT Data Center Systems          |       .|||.       .|||.
>        Cisco Linux Developer           |   ..:|||||||:...:|||||||:..
>        Phone: (919) 392-7363           |   C i s c o   S y s t e m s
> 
> _______________________________________________
> Yum mailing list
> Yum@xxxxxxxxxxxxxxxxxxxx
> https://lists.dulug.duke.edu/mailman/listinfo/yum
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.dulug.duke.edu/pipermail/yum/attachments/20050630/4d55c0f4/attachment.bin