Re: Getting meta data (of any type) for an XML file being from it's URL.

Tommy Pham <tommyhp2@xxxxxxxxx> · Thu, 29 Sep 2011 15:27:11 -0700

On Thu, Sep 29, 2011 at 9:09 AM, Richard Quadling <rquadling@xxxxxxxxx>wrote:

> Hi.
>
> I'm looking to process very large XML files without the need of first
> downloading them.
>
> To that end, SimpleXMLIterator('compress.zlib://
> http://www.site.com/products.xml.gz')
> is working perfectly.
>
> But a downside is that I have no information of my progress.
>
> Is there any mechanism available to get a position within the XML stream?
>
> I can use libxml_set_streams_context() to set a context (so I can
> provide POST data if needed for the HTTP request), but I can't see how
> to gain access to the stream within the libxml code (SimpleXML uses
> libxml).
>
> At the most basic, what I'm looking for is to be able to record a
> percentage complete. Even with compression, I'll be reading some bytes
> from a stream (either from http or from compress.zlib) and I want to
> know where I am in that stream.
>
> The HTTP header will tell me how big the file is (so that can easily
> be a HEAD request to get that data).
>
>
> Even if I DO save the file locally first, I still can't get a position.
>
> If I use the SimpleXMLIterator::count() method, I am unsure as to what
> will happen if I am using a stream (rather than a local file). If I
> use ...
>
> $xml = new SimpleXMLIterator(...);
> $items = $xml->count();
> foreach($xml as $s_Tag => $o_Item) {
>  ...
> }
>
> will the XML file be cached somewhere? Or will that depend upon the
> originating server supporting some sort of rewind/chunk mechanism?
>
>
>
> Any suggestions/ideas?
>
>
>
> Richard.
>
>
> --
> Richard Quadling
> Twitter : EE : Zend : PHPDoc
> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea
>
>
Richard,

Only think I can think of is break up into 2 parts.  1st part use cURL for
the down streams and monitor it's progress.  2nd part is XML parsing.  If
you want to do both simultaneously and only if the remote supports chunked
encoding/transfer, you could use multiple handles of cURL into numerical
indexed array variable containing the chunks for XML parsing.  This could be
memory intensive if the file is very large, depending whether you free the
elements in the array as you progress.  So if memory consumption is a major
concern, you may want to save to the file locally until the import/migration
is done as there maybe a situation where you'll need to review the data and,
thus, saving the time and bandwidth of having to re-download the data.

Regards,
Tommy