Michael Stenner writes: > I'm actually open to picking this issue up again if there's interest. > We imagined a "batch grabber" which would be a grabber wrapper object > much like the mirrorgroup stuff, but which would take a list of files > (perhaps a queue for pipelining applications) and go to town on them. It might be good to first determine the "advantage" of parallel downloads and the number of systems that would benefit. Presumably, parallel downloads are "good" because they make things run a bit faster in cases where one is bandwidth-limited at the server side (so your local NIC is spending time idle that could be invested on other connections). However, I'll wager that for MOST users of yum, they are either on a reasonably local network connection directly to a professionally administered server with plenty of server-side bandwidth (so that shared or not, little benefit accrues from the parallelization) or they are sitting at the end of e.g. a DSL or cable link, with a peak bandwidth of a few hundred KB/sec. In this latter case there is ALSO little benefit in most cases, as one is most likely bandwidth throttled at your end of the connection and not the server side. Parallelization might well actually slow one down, as handling four connections in parallel generally takes LONGER than the same four connections serialized if the servers aren't resource constrained. I'd argue that the ones who >>will<< benefit are mostly edge cases -- people who are running systems that aren't in a professionally administered LAN with its own local mirror(s), and that are resource starved (and can't afford e.g. the disk to just do a local mirror and sync it up in off-peak hours and be done with it), that use toplevel repositories that ARE hammered enough to underserve even DSL bandwidth instead of a mirror with more outgoing capacity. Are there enough of them that this is a serious issue and worth the hassle? Aren't there enough alternative solutions that they should be considering already? This kind of scaling/performance analysis is the downside to parallelization in general. Parallelization only "improves" things if the critical bottlenecks are in just the right places. Otherwise it actually slows things down and wastes resources. Most users will not be sophisticated enough to know which category they are in (which could even change in time during a single download as a server goes from IMMEDIATELY being overloaded back to IMMEDIATELY being underloaded). This means that you have to make your tool smart in direct proportion to the ignorance of your users and stop doing a network transfer in parallel if in fact a single site is saturating your incoming bandwidth already and smoothly continue with other pending download transactions if the site momentarily slows. Being careful not to resource starve any particular connection, etc. Bleah. And in the end, you will USUALLY save at most a relatively few seconds, at the risk of actually taking longer. Serialized connections/downloads are so much simpler and performance limitations are easy to understand and work around. All of these suggestions (parallelization, BT) seem to be directed at at relatively few systems to enable them to make efficient use of one or another "critical" resource or avoid some particular bottleneck. In particular, either client-side bandwidth or local disk consumption. I honestly think that the issue of client-side and server-side bandwidth (and the efficient utilization of same) needs to be reexamined in light of the brave new world of huge client disks. Disk is now SO cheap that yum could actually benefit from having an automirror mode that could be enabled by e.g. automirror_path=/yum_mirror/fc$releasever/$basearch automirror=1 automirror_all=1 in a repo file. If automirror=1, running "yum update mirror" would cause yum to rsync the baseurl to the local path. If automirror_all were 1, it would do the whole thing (the right thing to do for base and updates) and make a true mirror; otherwise it would just do installed files and deliberately NOT do the whole repository (the right thing to do for just a few packages snarfed from an "extras" repository, especially one that largely is redundant with one of the base or updates repositories in your primary trusted url paths). For repositories that were automirrored, all other yum activities would be addressed out of the local mirror, if possible, without checking the remote repository at all, if possible. So a yum update wouldn't look at the remote site at all, yum list wouldn't look at a remote site at all. Presumably one would add "yum update mirror" to the nightly yum cron, run well before yum update. It would be especially nice if the yum update mirror had some way of proxying so that one could update a mirror of an internally trusted repo from outside the trusted network with the right credentials. This is just what lots of us have implemented anyway, as Sean noted, except that we don't use "yum commands" to do the mirroring/syncing and we all have to write our own scripts to accomplish pretty much the same thing. This is a lot of duplicated work, which is usually a pretty good signal that it is time to become a "feature" instead of a custom-crafted add-on. Note well that functionally, this is only a LITTLE more than what yum does now, except that instead of caching rpms it would put the rpms into the local mirror path and "complete" the yummification of the result into a repository according to the given path (and could probably replace /var/yum/cache with a suitable symlink in the process). In a home DSL-bottlenecked LAN, running this on a single host and exporting it to the other hosts via NFS or httpd aas an actual repo would be even more conservative of bandwidth and enhancing of performance. This sort of thing is likely, in my opinion, to have far greater positive impact on bandwidth bottlenecks than either parallelization or BT. Nothing beats local repository mirrors for performance and ease of use. Installing them also reduces server load on the toplevel servers and is a socially conscious thing to do so that there is more bandwidth left to serve the edge cases that can't or don't do their own mirrors. Currently, devoting even 10 GB to a local mirror (especially a local mirror that is then exported to a local LAN) costs what, $5 or $10? Divided by the number of hosts that use it in a LAN of any sort? This is just a no brainer in most environments, although sure there are going to be laptops or old PCs where disk is still a "scarce resource". The main reason that EVERYBODY doesn't do this (or nearly so) is that it isn't "easy" to do a local mirror without knowing something or following a fairly complex set of directions. I personally think that it isn't terribly unreasonable to EXPECT yum to run more slowly and inconveniently on resource-starved older systems. Everything else does too, after all -- this is why we call them "resource-starved older systems". Rewriting a complex tool to be even more complex to provide a small marginal benefit for those systems seems like a poor use of scarce developer time. rgb -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://lists.dulug.duke.edu/pipermail/yum/attachments/20050630/49eceb9d/attachment.bin