[Yum] [UG] parallelizing downloading

rgb at phy.duke.edu (Robert G. Brown) · Fri Jul 1 10:09:32 2005

Brian Long writes:

> On Thu, 2005-06-30 at 19:59 -0400, Robert G. Brown wrote:
>> Brian Long writes:
>> 
>> > Even on 100BaseT or Gigabit LANs inside the same DC, parallel downloads
>> > reduce the time it takes a sysadmin to patch their linux host.  When we
>> 
>> Impossible, 
> 
> I understand what you're saying and I apologize for the mistake.  The
> only time we've seen parallelization improve things is over the WAN.
> Because of 65ms latency between RTP, NC and San Jose, CA, we can multi-
> stream downloads faster than we can perform a single-stream download.
> 
> For example, with our current OC-12, a single host might get 3MB/sec
> single stream between the two destinations, but that same host can get
> 4+MB/sec with 2 streams, 5+MB/sec with 3 streams, etc.
> 
> When we had a T3 between RTP and SJ 5 years ago, FTP's throughput was
> something like 300KB/sec single stream, but I could get 1-2MB/sec with
> multiple streams between the same two hosts.  Altering TCP windows
> didn't seem to help (this was on Solaris).

All of this makes perfect sense (although the FTP thing is a bit bizarre
-- I'd guess that the host was deliberately set up with an FTP choke of
some sort).  As I said, if you are using multiple servers and have hosts
waiting on one server while another still has capacity it isn't using
(at whatever network speed) you can see significant speedup.  This is
quite possible over a WAN but "shouldn't" really happen within a LAN.
It shouldn't have happened for the FTP streams from the same server,
either, but who knows what evil lurks in the heart of such servers and
their often occult configurations?

> I was incorrect in mentioning the LAN + parallelization.  I meant to
> state that over a WAN, it helps extraordinarily.  :-)  Considering we
> plan to keep our server farm in SJ and cache the content globally, I
> guess parallelization isn't a must.  It would be nice when we have
> development hosts in RTP point to our SJ yum repos, though.

But simpler still is to just mirror your SJ repos on an internal
developer host in RTP.  Cisco can afford the $100 to buy you a dedicated
200 GB disk to mirror every repo known to man, and in the long run using
rsync periodically (say 1-4x a day, if there is aggressive development
or the like you want to stay up with) is far more conservative of
bandwidth through a slow link even than parallelizing individual
connections back to the possibly hammered primary sites).  For example,
updating five or six hosts from SF through a slow intermediary link
would take much longer than updating the local mirror from SF and then
updating locally from the mirror.  Although if you cache (depending on
just HOW you cache) it will probably have the same effect -- a mirror is
just a "persistent cache" from a certain point of view.

I personally prefer the mirror (persistent preloaded cache) idea because
it works even if the dynamic cache isn't loaded or if the intermediate
network is slow or down -- there are days when some of the links in
Atlanta in the network path from my house (2 miles from Duke) to my
office (yes, it is 16 hops routed through Atlanta, go figure) are very
slow -- often when a new virus has surfaced and is in that
bandwidth-sucking bloom stage when nobody is patched.  The WAN can then
actually underperform by Bronze DSL connection (by a factor of 10 or so)
and even working via a terminal is a PITA and bouncy.  On those days I
cherish having a local mirror of FC 2 and FC 3 at home.  My laptop has
enough room on it that one day pretty soon I'm going to mirror FC
(probably linux@duke 4 plus extensions) on IT, which will really help me
out with the proxy thing -- just authenticating connections to get to a
repo through a proxy link is a PITA.

It sounds like there may be work afoot to make yum capable of
automagically building a local repo/"supercache".  This would be a great
boon and would basically make all of this sort of hassle a moot thing.
If you have the disk (typically <10 GB to get "everything and the
kitchen sink too") you can clone the whole set of repos that you use in
your install, or there are all sorts of ways to clone "relevant subsets"
so you can just get the base and updates plus a local "extras" that
grabs this and that from the various dag-like sites out there or from
your organizations private add-on repo.  

Along with the "automate the handling of keys" thing so num-nums can
manage them instead of turning them off, I think that this would be an
ab-fab thing to do next.  Both of these are real hassles for real
administrators of real networks and would be popular with many people at
all levels, and the automagic would (in my opinion) SERIOUSLY reduce the
load on toplevel servers, as people would rapidly build a diffuse
bittorrent-like cloud of repo clones (there's a scary idea -- bring on
the repo clones:-) so that relatively few "yum update"s actually hit the
original distro sites for rpms or even header data.

    rgb

> 
> /Brian/
> -- 
>        Brian Long                      |         |           |
>        IT Data Center Systems          |       .|||.       .|||.
>        Cisco Linux Developer           |   ..:|||||||:...:|||||||:..
>        Phone: (919) 392-7363           |   C i s c o   S y s t e m s
> 
> _______________________________________________
> Yum mailing list
> Yum@xxxxxxxxxxxxxxxxxxxx
> https://lists.dulug.duke.edu/mailman/listinfo/yum
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.dulug.duke.edu/pipermail/yum/attachments/20050701/09e59409/attachment.bin