On Sat, Oct 11, 2003 at 02:00:50PM -0400, Ryan Tomayko wrote: > On Mon, 2003-10-06 at 08:16, Michael Stenner wrote: > > I have a version of urlgrabber that supports urlopen (returns a file > > object) and urlread (returns a string containg the file contents), > > Any word on this? I've been working on include= functionality and was > hoping to use urlgrabber.urlopen as a drop in replacement for > urllib.urlopen(). Yes. I just pushed up 0.2 now. This includes: urlopen -- returns a file object urlread -- returns a string containing file contents Both of these support both throttling and progress meters, which required that I get a bit tricky. (See URLGrabberFileObject) It also now attempts to preserve the timestamps of grabbed files. > I'm in the process of working urlgrabber in as-is but have run into a > couple slight problems. They're not so much problems as they are > annoyances. For instance, with urllib.urlopen() I get a file object that > I simply discard when I'm done reading. When I try to work urlgrabber > into the mix, I'm having to allocate a temp file, grab the file, open > the file object, and then get rid of the temp file. Yes, you can simply use urlopen from urlgrabber now. > Another nice thing about the file object returned by urllib.urlopen() is > that it has a geturl() method. With urlgrabber, I'm having to create a > (url, file object) tuple to keep that same info around. The geturl() method is supported by urlgrabber's file objects as well. (although there was a bug fixed in keepalive that made it not quite work right... you'll need to grab the keepalive.py from 0.2 as well) > Again, these are not really big issues at all, it just leads to a bit > more code right now. Given all that, I have a couple of questions about > urlgrabber.urlopen(). > * Will urlopen return a real file object or a urllib file object? > (I'm really just looking for a fileobject.geturl() method). This is a metaphysical question, really :) The bottom line is that .geturl() is supported. In python there's not really such a thing as a "real file object". Sure, you could define objects returned by file() as the "real" thing, but all that's really important is that an object support the correct methods and attributes. Currently, urlgrabber returns either a urllib, urllib2 or URLGrabberFileObject file object depending on the circumstances. The latter is used if you are doing throttling or progress meters with a urlopen or urlread. > * Will it support all the nice regrab functionality of urlgrabber? If you do retrygrab, it certainly will. urlopen and urlread currently do not support retrying. I could probably make urlread support it, but making urlopen do it would be a touch harder. What do you do if the socket drops half-way through the file? > * Will the file object returned require cleanup? i.e. Is it safe to read > it and move on or will code using urlopen be required to cleanup > a temp file or something? urlopen and urlread do not create temp files. This is deliberate and has many reasons. One big one is security. The only real cleanup is that you should do fo.close() when you're done reading. I doubt anything really catastrophic would happen if you don't do this, but I know the progress bar uses this, for example. > * Oh yea.. Where do you plan on comitting? It looks like HEAD is now > 2.1.0 development. I've been working against dailies but the include= > stuff should move smoothly into the 2.1.0 code. urlgrabber is getting moved out of yum. The plan is that it will be developed in its own cvs, and occasionally get copied into yum (and other projects). I'll put it in HEAD shortly. For now, just drop it in your own working directory. You can grab the new urlgrabber (no pun intended) from here: http://www.linux.duke.edu/projects/mini/urlgrabber/dist/ -Michael I will be posting a second email about the future of urlgrabber shortly, which you may want to read :) -- Michael Stenner Office Phone: 919-660-2513 Duke University, Dept. of Physics mstenner@xxxxxxxxxxxx Box 90305, Durham N.C. 27708-0305