Re: Getting meta data (of any type) for an XML file being from it's URL.

Tommy Pham <tommyhp2@xxxxxxxxx> · Thu, 29 Sep 2011 15:34:32 -0700

On Thu, Sep 29, 2011 at 3:27 PM, Tommy Pham <tommyhp2@xxxxxxxxx> wrote:

> On Thu, Sep 29, 2011 at 9:09 AM, Richard Quadling <rquadling@xxxxxxxxx>wrote:
>
>> Hi.
>>
>> I'm looking to process very large XML files without the need of first
>> downloading them.
>>
>> To that end, SimpleXMLIterator('compress.zlib://
>> http://www.site.com/products.xml.gz')
>> is working perfectly.
>>
>> But a downside is that I have no information of my progress.
>>
>> Is there any mechanism available to get a position within the XML stream?
>>
>> I can use libxml_set_streams_context() to set a context (so I can
>> provide POST data if needed for the HTTP request), but I can't see how
>> to gain access to the stream within the libxml code (SimpleXML uses
>> libxml).
>>
>> At the most basic, what I'm looking for is to be able to record a
>> percentage complete. Even with compression, I'll be reading some bytes
>> from a stream (either from http or from compress.zlib) and I want to
>> know where I am in that stream.
>>
>> The HTTP header will tell me how big the file is (so that can easily
>> be a HEAD request to get that data).
>>
>>
>> Even if I DO save the file locally first, I still can't get a position.
>>
>> If I use the SimpleXMLIterator::count() method, I am unsure as to what
>> will happen if I am using a stream (rather than a local file). If I
>> use ...
>>
>> $xml = new SimpleXMLIterator(...);
>> $items = $xml->count();
>> foreach($xml as $s_Tag => $o_Item) {
>>  ...
>> }
>>
>> will the XML file be cached somewhere? Or will that depend upon the
>> originating server supporting some sort of rewind/chunk mechanism?
>>
>>
>>
>> Any suggestions/ideas?
>>
>>
>>
>> Richard.
>>
>>
>> --
>> Richard Quadling
>> Twitter : EE : Zend : PHPDoc
>> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea
>>
>>
> Richard,
>
> Only think I can think of is break up into 2 parts.  1st part use cURL for
> the down streams and monitor it's progress.  2nd part is XML parsing.  If
> you want to do both simultaneously and only if the remote supports chunked
> encoding/transfer, you could use multiple handles of cURL into numerical
> indexed array variable containing the chunks for XML parsing.  This could be
> memory intensive if the file is very large, depending whether you free the
> elements in the array as you progress.  So if memory consumption is a major
> concern, you may want to save to the file locally until the import/migration
> is done as there maybe a situation where you'll need to review the data and,
> thus, saving the time and bandwidth of having to re-download the data.
>
> Regards,
> Tommy
>

An afterthought,  handling streams and parsing at the same may require
sophisticated XML node validations as the node maybe split between the
transferred chunks.