On 29 Jul 2003, Aleksander Demko wrote: > On Mon, 2003-07-28 at 16:17, Robert G. Brown wrote: > ... > > wget is pretty simple as well, but you have to tell it to decend > > recursively to an appropriate depth or use the --mirror option. > > Something like: > > > > wget --mirror http://whatever.repository.youlike.org -o > > /tmp/mirror_log > > Yeah, that doesn't work quite right. Without parameter fiddling, I get > non-repository files (.html, etc) as well as it may go UP the url and > continue to suck down. I didn't realize this until I pulled down 3+ gig > from Duke... had stuff like 7.x updates to (useless to me, as we have no > <8 machines)... luckily we get like a megabyte a second from you guys. I know you have, but RFM a few more times (it is pretty long and complicated:-). I recall that there are options for controlling whether and/or how far up and down it extends recursively. I also wasn't clear -- you probably wanted something more like: wget --mirror \ http://whatever.repository.youlike.org/pub/linux/distro_9 \ -o /tmp/mirror_log which doesn't mirror the WHOLE repository, but only the distro_9 part. IIRC it will do something moderately horrible with the path on your mirror site -- you might get ./pub/linux/distro_9. It doesn't really behave like a cp. > Also, I don't think wget has the guts to actually remove files that are > no longer on the repository. Ya, wget is an adequate tool but hardly sparkly or exciting, because it uses httpd itself to deliver the files. That is, it doesn't really mirror anything -- it is a very specialized scripted browser that connects to a server and retrieves every file it finds, recursively, in a tree. Of course, a lot of "files" it might find could be active/cgi files. The best it can do is save whatever it is that it was presented with, which is probably not the cgi source. Not a copy at all -- more like a browser "save" feature, recursive, with path, and as you note conservative to a fault. > > should do it. You have to look to see if rsync works for each > > repository you might want to mirror. Where it works it is "better". It > > is also reasonable to ask permission before mirroring regularly from a > > public repository that doesn't already grant it openly. Some sites have > > spare bandwidth and a public-spiritedness, others don't. > > I've never used rsync, but I don't think I can use it here. Our heavily > DMZ'ed public http server can really only do HTTP requests, and even > those I back tunnel over ssh to a proxy. I think I'm restricted to > HTTP-pure mirroring techniques. rsync is actually by far the preferred tool. It is designed to do precisely what you need (synchronize two images, perfectly), efficiently (copying compressed images of just what has changed), and safely (where you can select whether or not to delete files that are no longer in the images being sync'd. The issue of whether or not they support it on the repository you're trying to mirror is a policy issue, of course, and you may or may not have any control or voice there, but it is certainly worth opening a discussion with the owners and asking for it. Here are the arguments: rsync on top of ssh CAN be fully authenticated with really strong authentication. Anonymous rsync can use rsh, ssh, rsync(d) itself as a server, or rsyncd as transport for a web proxy. Of these, ssh is extremely strong host/user-level authentication and fully access controllable -- the issue there isn't whether or not ssh is a secure mechanism, it is whether or not they'll give >>you<< ssh access. This depends on who you are and so forth, usually. I have no idea how strong rsyncd is as a secure transport mechanism, but for unauthenticated anonymous access to a selected tree it is probably secure to within stupidity in setting up the tree and the eternal possibility of e.g. buffer overwrite attacks in any daemon listening on an open port. It does have a slew of options for authenticated access (including host/domain authentication) chroot on the provided tree, and so forth and likely is comparable to httpd itself in overall security. Web proxy is weak/stupid authentication in cleartext and hence probably not a great idea in any event, either to support wget or to support rsyncd. I used it for the first time yesterday and learned to my chagrin that it doesn't run on top of ssl, which means that used over broadband networks it is just an open invitation to password snoops. Nobody (intelligent) permits telnet or rsh access anymore because password snooping used to be the number one security risk of nearly any unixoid LAN. Somehow web proxies have escaped that, but they should and will follow unless they are ssl-ified so no cleartext passwords ever are used. The decision to support (anonymous or other) rsync access is a serious one, of course, but lots of very paranoid repositories permit rsync one way or another -- sometimes several ways, for different parts of the tree. Even Seth permits it, sometimes, and we have to regularly medicate him so that he doesn't jump on people passing in the hall and beat them with a sucker rod while screaming "Crackers! Crackers! Stay away from my servers!" (which in the South is likely to be misunderstood:-). [In truth, a standalone webserver that is properly backed up and monitored isn't that big a deal if it IS cracked -- shut it down, restore it, bring it up -- as little as an hour of downtime total, provided of course that you can determine and shut down the cracker's point of egress... at worst a momentary embarrassment.] In fact, if you can talk the owners of the repository into giving you ssh access (because you are a systems person and because your rsync-maintained mirror REDUCES load on their server) you don't NEED any sort of daemon -- I use rsync as a command line tool on top of ssh for all of my quotidian mirroring needs: rsync -avz host:path . to get all changed files, verbosely, preserving files that no longer occur, and send them compressed or rsync -avz --delete host:path . to make a "perfect" mirror including deleting any files that have been removed. rsync rocks. Note that you also need --rsh=ssh or set RSYNC_RSH to ssh. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@xxxxxxxxxxxx