Again, if you don't know what urlgrabber is, you don't need to read this. I am actively requesting input from Jeremy, Seth, and Icon. I would love to input from others as well (Ryan?), but these are the ones that will get he beatings. Here is the basic design that I have in mind. This (intentionally) has no mention of internal workings. It only discusses things that matter to someone that would USE the module. Internal design is certainly open for discussion, but I only want to talk about it now to the extent that it affects interface. -Michael ======================================================================= MAIN FUNCTIONS: urlgrab -- Fetch a url and make a local copy. Return the filename urlopen -- Return a file object for the specified url. urlread -- Read the specified file into a string and return int. retrygrab -- Wrapper for urlgrab the retries given certain errors. retryopen -- Wrapper for urlopen the retries given certain errors. retryread -- Wrapper for urlread the retries given certain errors. NOTE: retryopen can't protect you from errors that occur AFTER the connection is made. It can only retry setting up the connection. FEATURES: * identical behavior for http, ftp, and file Options that change the behavior for one protocol (like copy_local) are OK as long as they don't affect the other protocols. However, something like byte-ranges MUST work for all protocols. These are different because byte-ranges CHANGE the return value for a given input. copy_local only modifies the internal behavior. All options must by syntactically legal for ALL urls. The whole point is to have the library not care what sort of url is passed in. * smart url interpretation - handle "normal local filenames" also - handle url-encoded username/password for ftp and http (and file? smb?) * byte ranges * reget support - internally supported via byte ranges - several reget modes + never: always start from the beginning + force: always pick up from the end of the local file + smart: check timestamps, length, etc. * throttling * progress meter * i18n support (if the calling application provides translations) * settable User-Agent * http keepalive (via the keepalive module) * timestamp preservation INTERFACE: I'm considering changing the function interface a little. There are just getting to be an insane number of options, and I'm not sure how to deal with it. There is also the issue of passing options through retry*. Option 1 (the way it is now, everything is a kwarg) def urlgrab(url, filename=None, copy_local=0, close_connection=0, progress_obj=None, throttle=None, bandwidth=None): def retrygrab(url, filename=None, copy_local=0, close_connection=0, progress_obj=None, throttle=None, bandwidth=None, numtries=3, retrycodes=[-1,2,4,5,6,7], checkfunc=None): This is REALLY ugly and it makes it very hard to cleanly add options. Specifically, what if someone does: retrygrab(url, fn, 1, 0, None, None, None, 5) # the last is numtries and then we later add more options to urlgrab? Sure, it's not likely, and sure, I put a warning to only use these as kwargs in the doc, but still. It's very icky. However, it is very clear and very normal. Option 2 def urlgrab(url, filename=None, **kwargs): def retrygrab(url, filename=None, **kwargs): retrygrab could then strip out the options it cares about and pass on the rest. This makes the function definition very clean, but completely useless to look at. The legal args would have to go in the docs. One of the up-sides is that things could ONLY be called as keyword args so the ordering is irrelevant. Option 3 def urlgrab(url, filename=None, options=None): def retrygrab(url, filename=None, optionsNone): Same as 2, but instead of calling as: urlgrab(url, copy_local=1) it must be urlgrab(url, options={'copy_local':1}) I don't really like this option. It's just a step on the way to the next one :) Option 4 def urlgrab(url, filename=None, options=None): def retrygrab(url, filename=None, options=None, retry_options=None): Here, the options arg to retrygrab would get passed through untouched, and retry_options would be ONLY for options related to the retry process. I'm open to other ideas... If I had to pick now, I'd probably go with (2), but I'm still quite open. STRUCTURE: Because urlgrabber already consists of at least two files (urlgrabber.py and keepalive.py), I'm thinking of making it a "package" (directory with sub-modules inside). One might argue that this is the only sane way to go if it's going to be a tidy library. This will also make life much easier if we need to do "parallel installs" farther down the road. Then again, maybe keepalive.py and progress_meter.py should be separate! -- Michael Stenner Office Phone: 919-660-2513 Duke University, Dept. of Physics mstenner@xxxxxxxxxxxx Box 90305, Durham N.C. 27708-0305