[Yum] new urlgrabber design

mstenner at phy.duke.edu (Michael Stenner) · Mon Oct 13 17:19:01 2003

Again, if you don't know what urlgrabber is, you don't need to read
this.  I am actively requesting input from Jeremy, Seth, and Icon.  I
would love to input from others as well (Ryan?), but these are the
ones that will get he beatings.

Here is the basic design that I have in mind.  This (intentionally)
has no mention of internal workings.  It only discusses things that
matter to someone that would USE the module.  Internal design is
certainly open for discussion, but I only want to talk about it now to
the extent that it affects interface.

					-Michael

=======================================================================
MAIN FUNCTIONS:

  urlgrab   --  Fetch a url and make a local copy.  Return the filename
  urlopen   --  Return a file object for the specified url.
  urlread   --  Read the specified file into a string and return int.
  retrygrab --  Wrapper for urlgrab the retries given certain errors.
  retryopen --  Wrapper for urlopen the retries given certain errors.
  retryread --  Wrapper for urlread the retries given certain errors.

  NOTE: retryopen can't protect you from errors that occur AFTER the
        connection is made.  It can only retry setting up the connection.

FEATURES:

  * identical behavior for http, ftp, and file
      Options that change the behavior for one protocol (like
      copy_local) are OK as long as they don't affect the other
      protocols.  However, something like byte-ranges MUST work for
      all protocols.  These are different because byte-ranges CHANGE
      the return value for a given input.  copy_local only modifies
      the internal behavior.

      All options must by syntactically legal for ALL urls.  The whole
      point is to have the library not care what sort of url is passed
      in.

  * smart url interpretation
    - handle "normal local filenames" also
    - handle url-encoded username/password for ftp and http (and file? smb?)

  * byte ranges

  * reget support
    - internally supported via byte ranges
    - several reget modes
      + never:  always start from the beginning
      + force:  always pick up from the end of the local file
      + smart:  check timestamps, length, etc.

  * throttling

  * progress meter

  * i18n support (if the calling application provides translations)

  * settable User-Agent

  * http keepalive (via the keepalive module)

  * timestamp preservation

INTERFACE:

  I'm considering changing the function interface a little.  There are
  just getting to be an insane number of options, and I'm not sure how
  to deal with it.  There is also the issue of passing options through
  retry*.

  Option 1 (the way it is now, everything is a kwarg)

    def urlgrab(url, filename=None, copy_local=0, close_connection=0,
                progress_obj=None, throttle=None, bandwidth=None):
    def retrygrab(url, filename=None, copy_local=0, close_connection=0,
                progress_obj=None, throttle=None, bandwidth=None,
                numtries=3, retrycodes=[-1,2,4,5,6,7], checkfunc=None):

    This is REALLY ugly and it makes it very hard to cleanly add
    options.  Specifically, what if someone does:

      retrygrab(url, fn, 1, 0, None, None, None, 5) # the last is numtries

    and then we later add more options to urlgrab?  Sure, it's not
    likely, and sure, I put a warning to only use these as kwargs in
    the doc, but still.  It's very icky.  However, it is very clear
    and very normal.

  Option 2

    def urlgrab(url, filename=None, **kwargs):
    def retrygrab(url, filename=None, **kwargs):

    retrygrab could then strip out the options it cares about and pass
    on the rest.  This makes the function definition very clean, but
    completely useless to look at.  The legal args would have to go in
    the docs.  One of the up-sides is that things could ONLY be called as
    keyword args so the ordering is irrelevant.

  Option 3

    def urlgrab(url, filename=None, options=None):
    def retrygrab(url, filename=None, optionsNone):

    Same as 2, but instead of calling as:
      urlgrab(url, copy_local=1)
    it must be
      urlgrab(url, options={'copy_local':1})

    I don't really like this option.  It's just a step on the way to
    the next one :)

  Option 4

    def urlgrab(url, filename=None, options=None):
    def retrygrab(url, filename=None, options=None, retry_options=None):

    Here, the options arg to retrygrab would get passed through
    untouched, and retry_options would be ONLY for options related to
    the retry process.

  I'm open to other ideas...  If I had to pick now, I'd probably go
  with (2), but I'm still quite open.

STRUCTURE:

  Because urlgrabber already consists of at least two files
  (urlgrabber.py and keepalive.py), I'm thinking of making it a
  "package" (directory with sub-modules inside).  One might argue that
  this is the only sane way to go if it's going to be a tidy library.

  This will also make life much easier if we need to do "parallel
  installs" farther down the road.

  Then again, maybe keepalive.py and progress_meter.py should be
  separate!

-- 
  Michael Stenner                       Office Phone: 919-660-2513
  Duke University, Dept. of Physics       mstenner@xxxxxxxxxxxx
  Box 90305, Durham N.C. 27708-0305