Re: Getting meta data (of any type) for an XML file being from it's URL.

Richard Quadling <rquadling@xxxxxxxxx> · Fri, 30 Sep 2011 11:03:17 +0100

On 29 September 2011 23:34, Tommy Pham <tommyhp2@xxxxxxxxx> wrote:
> On Thu, Sep 29, 2011 at 3:27 PM, Tommy Pham <tommyhp2@xxxxxxxxx> wrote:
>>
>> On Thu, Sep 29, 2011 at 9:09 AM, Richard Quadling <rquadling@xxxxxxxxx>
>> wrote:
>>>
>>> Hi.
>>>
>>> I'm looking to process very large XML files without the need of first
>>> downloading them.
>>>
>>> To that end,
>>> SimpleXMLIterator('compress.zlib://http://www.site.com/products.xml.gz')
>>> is working perfectly.
>>>
>>> But a downside is that I have no information of my progress.
>>>
>>> Is there any mechanism available to get a position within the XML stream?
>>>
>>> I can use libxml_set_streams_context() to set a context (so I can
>>> provide POST data if needed for the HTTP request), but I can't see how
>>> to gain access to the stream within the libxml code (SimpleXML uses
>>> libxml).
>>>
>>> At the most basic, what I'm looking for is to be able to record a
>>> percentage complete. Even with compression, I'll be reading some bytes
>>> from a stream (either from http or from compress.zlib) and I want to
>>> know where I am in that stream.
>>>
>>> The HTTP header will tell me how big the file is (so that can easily
>>> be a HEAD request to get that data).
>>>
>>>
>>> Even if I DO save the file locally first, I still can't get a position.
>>>
>>> If I use the SimpleXMLIterator::count() method, I am unsure as to what
>>> will happen if I am using a stream (rather than a local file). If I
>>> use ...
>>>
>>> $xml = new SimpleXMLIterator(...);
>>> $items = $xml->count();
>>> foreach($xml as $s_Tag => $o_Item) {
>>>  ...
>>> }
>>>
>>> will the XML file be cached somewhere? Or will that depend upon the
>>> originating server supporting some sort of rewind/chunk mechanism?
>>>
>>>
>>>
>>> Any suggestions/ideas?
>>>
>>>
>>>
>>> Richard.
>>>
>>>
>>> --
>>> Richard Quadling
>>> Twitter : EE : Zend : PHPDoc
>>> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea
>>>
>>
>> Richard,
>>
>> Only think I can think of is break up into 2 parts.  1st part use cURL for
>> the down streams and monitor it's progress.  2nd part is XML parsing.  If
>> you want to do both simultaneously and only if the remote supports chunked
>> encoding/transfer, you could use multiple handles of cURL into numerical
>> indexed array variable containing the chunks for XML parsing.  This could be
>> memory intensive if the file is very large, depending whether you free the
>> elements in the array as you progress.  So if memory consumption is a major
>> concern, you may want to save to the file locally until the import/migration
>> is done as there maybe a situation where you'll need to review the data and,
>> thus, saving the time and bandwidth of having to re-download the data.
>>
>> Regards,
>> Tommy
>
> An afterthought,  handling streams and parsing at the same may require
> sophisticated XML node validations as the node maybe split between the
> transferred chunks.
>

The SimpleXMLIterator() does do a superb job of providing 1 node at a
time and doesn't remember each node automatically. If I do, then
that's my issue.

Accessing the stream/file meta data doesn't seem possible.

-- 
Richard Quadling
Twitter : EE : Zend : PHPDoc
@RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php