[Yum] Re: urlgrabber.urlopen()

mstenner at phy.duke.edu (Michael Stenner) · Sat Oct 11 14:54:52 2003

On Sat, Oct 11, 2003 at 02:00:50PM -0400, Ryan Tomayko wrote:
> On Mon, 2003-10-06 at 08:16, Michael Stenner wrote: 
> > I have a version of urlgrabber that supports urlopen (returns a file
> > object) and urlread (returns a string containg the file contents),
> 
> Any word on this? I've been working on include= functionality and was
> hoping to use urlgrabber.urlopen as a drop in replacement for
> urllib.urlopen(). 

Yes.  I just pushed up 0.2 now.  This includes:

  urlopen    --  returns a file object
  urlread    --  returns a string containing file contents

  Both of these support both throttling and progress meters, which
  required that I get a bit tricky.  (See URLGrabberFileObject)

  It also now attempts to preserve the timestamps of grabbed files.

> I'm in the process of working urlgrabber in as-is but have run into a
> couple slight problems. They're not so much problems as they are
> annoyances. For instance, with urllib.urlopen() I get a file object that
> I simply discard when I'm done reading. When I try to work urlgrabber
> into the mix, I'm having to allocate a temp file, grab the file, open
> the file object, and then get rid of the temp file. 

Yes, you can simply use urlopen from urlgrabber now.

> Another nice thing about the file object returned by urllib.urlopen() is
> that it has a geturl() method. With urlgrabber, I'm having to create a
> (url, file object) tuple to keep that same info around. 

The geturl() method is supported by urlgrabber's file objects as well.
(although there was a bug fixed in keepalive that made it not quite
work right... you'll need to grab the keepalive.py from 0.2 as well)

> Again, these are not really big issues at all, it just leads to a bit
> more code right now. Given all that, I have a couple of questions about 
> urlgrabber.urlopen().

> * Will urlopen return a real file object or a urllib file object?
>   (I'm really just looking for a fileobject.geturl() method).

This is a metaphysical question, really :)  The bottom line is that
.geturl() is supported.  In python there's not really such a thing as
a "real file object".  Sure, you could define objects returned by
file() as the "real" thing, but all that's really important is that an
object support the correct methods and attributes.

Currently, urlgrabber returns either a urllib, urllib2 or
URLGrabberFileObject file object depending on the circumstances.  The
latter is used if you are doing throttling or progress meters with a
urlopen or urlread.

> * Will it support all the nice regrab functionality of urlgrabber?

If you do retrygrab, it certainly will.  urlopen and urlread currently
do not support retrying.  I could probably make urlread support it,
but making urlopen do it would be a touch harder.  What do you do if
the socket drops half-way through the file?

> * Will the file object returned require cleanup? i.e. Is it safe to read
>   it and move on or will code using urlopen be required to cleanup
>   a temp file or something?

urlopen and urlread do not create temp files.  This is deliberate and
has many reasons.  One big one is security.  The only real cleanup is
that you should do fo.close() when you're done reading.  I doubt
anything really catastrophic would happen if you don't do this, but
I know the progress bar uses this, for example.

> * Oh yea.. Where do you plan on comitting? It looks like HEAD is now
>   2.1.0 development. I've been working against dailies but the include=
>   stuff should move smoothly into the 2.1.0 code.

urlgrabber is getting moved out of yum.  The plan is that it will be
developed in its own cvs, and occasionally get copied into yum (and
other projects).  I'll put it in HEAD shortly.  For now, just drop it
in your own working directory.

You can grab the new urlgrabber (no pun intended) from here:

http://www.linux.duke.edu/projects/mini/urlgrabber/dist/

					-Michael

I will be posting a second email about the future of urlgrabber
shortly, which you may want to read :)
-- 
  Michael Stenner                       Office Phone: 919-660-2513
  Duke University, Dept. of Physics       mstenner@xxxxxxxxxxxx
  Box 90305, Durham N.C. 27708-0305