Re: Emergency! Performance downloading big files

Nathan Nobbe <quickshiftin@xxxxxxxxx> · Tue, 1 Dec 2009 22:32:31 -0700

On Tue, Dec 1, 2009 at 4:56 PM, LinuxManMikeC <linuxmanmikec@xxxxxxxxx>wrote:

> On Tue, Dec 1, 2009 at 3:48 PM, Brian Dunning <brian@xxxxxxxxxxxxxxxx>
> wrote:
> >
> > This is a holiday-crunch emergency.
> >
> > I'm dealing with a client from whom we need to download many large PDF
> docs 24x7, several thousand per hour, all between a few hundred K and about
> 50 MB. Their security process requires the files to be downloaded via https
> using a big long URL with lots of credential parameters.
> >
> > Here's how I'm doing it. This is on Windows, a quad Xeon with 16GB RAM:
> >
> > $ctx = stream_context_create(array('http' => array('timeout' => 1200)));
> > $contents = file_get_contents($full_url, 0, $ctx);
> > $fp = fopen('D:\\DocShare\\'.$filename, "w");
> > $bytes_written = fwrite($fp, $contents);
> > fclose($fp);
> >
> > It's WAY TOO SLOW. I can paste the URL into a browser and download even
> the largest files quite quickly, but the PHP method bottlenecks and cannot
> keep up.
> >
> > Is there a SUBSTANTIALLY faster way to download and save these files?
> Keep in mind the client's requirements cannot be changed. Thanks for any
> suggestions.
> >
> >
>
> Well one problem with your code is file_get_contents.  Its downloading
> the entire file, putting it in a variable, then returning that
> variable.  Then you write this super huge variable (as much as 50MB
> from what you said) to a file.  If you think about what might be going
> on underneath that seemingly simple function, there could be millions
> of memory reallocations occurring to accommodate this growing
> variable.  I would instead use fopen and read a set number of bytes
> into a buffer variable (taking into consideration available bandwidth)
> and write it to the file.  That said, I would never do this kind of a
> program in PHP.  Like other's have suggested, use curl or wget, and
> you can interface with it through PHP to initiate and control the
> process if you need to.

agreed.  ideally a memory buffer size would be defined and as it filled, it
would periodically be flushed to disk.. thinks back to C programming in
college.

in this day and age, id just give the curl option CURLOPT_FILE a shot as its
most likely implementing said logic already.

depending on the upstream bandwidth your client has, and your download
bandwidth, you may also see greater throughput by downloading multiple files
in parallel, aka, curl_multi_init() ;)

-nathan