Re: Getting meta data (of any type) for an XML file being from it's URL.

Tommy Pham <tommyhp2@xxxxxxxxx> · Sat, 1 Oct 2011 13:47:13 -0700

On Fri, Sep 30, 2011 at 3:03 AM, Richard Quadling <rquadling@xxxxxxxxx>wrote:

> On 29 September 2011 23:34, Tommy Pham <tommyhp2@xxxxxxxxx> wrote:
> > On Thu, Sep 29, 2011 at 3:27 PM, Tommy Pham <tommyhp2@xxxxxxxxx> wrote:
> >>
> >> On Thu, Sep 29, 2011 at 9:09 AM, Richard Quadling <rquadling@xxxxxxxxx>
> >> wrote:
> >>>
> >>> Hi.
> >>>
> >>> I'm looking to process very large XML files without the need of first
> >>> downloading them.
> >>>
> >>> To that end,
> >>> SimpleXMLIterator('compress.zlib://http://www.site.com/products.xml.gz
> ')
> >>> is working perfectly.
> >>>
> >>> But a downside is that I have no information of my progress.
> >>>
> >>> Is there any mechanism available to get a position within the XML
> stream?
> >>>
> >>> I can use libxml_set_streams_context() to set a context (so I can
> >>> provide POST data if needed for the HTTP request), but I can't see how
> >>> to gain access to the stream within the libxml code (SimpleXML uses
> >>> libxml).
> >>>
> >>> At the most basic, what I'm looking for is to be able to record a
> >>> percentage complete. Even with compression, I'll be reading some bytes
> >>> from a stream (either from http or from compress.zlib) and I want to
> >>> know where I am in that stream.
> >>>
> >>> The HTTP header will tell me how big the file is (so that can easily
> >>> be a HEAD request to get that data).
> >>>
> >>>
> >>> Even if I DO save the file locally first, I still can't get a position.
> >>>
> >>> If I use the SimpleXMLIterator::count() method, I am unsure as to what
> >>> will happen if I am using a stream (rather than a local file). If I
> >>> use ...
> >>>
> >>> $xml = new SimpleXMLIterator(...);
> >>> $items = $xml->count();
> >>> foreach($xml as $s_Tag => $o_Item) {
> >>>  ...
> >>> }
> >>>
> >>> will the XML file be cached somewhere? Or will that depend upon the
> >>> originating server supporting some sort of rewind/chunk mechanism?
> >>>
> >>>
> >>>
> >>> Any suggestions/ideas?
> >>>
> >>>
> >>>
> >>> Richard.
> >>>
> >>>
> >>> --
> >>> Richard Quadling
> >>> Twitter : EE : Zend : PHPDoc
> >>> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea
> >>>
> >>
> >> Richard,
> >>
> >> Only think I can think of is break up into 2 parts.  1st part use cURL
> for
> >> the down streams and monitor it's progress.  2nd part is XML parsing.
> If
> >> you want to do both simultaneously and only if the remote supports
> chunked
> >> encoding/transfer, you could use multiple handles of cURL into numerical
> >> indexed array variable containing the chunks for XML parsing.  This
> could be
> >> memory intensive if the file is very large, depending whether you free
> the
> >> elements in the array as you progress.  So if memory consumption is a
> major
> >> concern, you may want to save to the file locally until the
> import/migration
> >> is done as there maybe a situation where you'll need to review the data
> and,
> >> thus, saving the time and bandwidth of having to re-download the data.
> >>
> >> Regards,
> >> Tommy
> >
> > An afterthought,  handling streams and parsing at the same may require
> > sophisticated XML node validations as the node maybe split between the
> > transferred chunks.
> >
>
> The SimpleXMLIterator() does do a superb job of providing 1 node at a
> time and doesn't remember each node automatically. If I do, then
> that's my issue.
>
> Accessing the stream/file meta data doesn't seem possible.
>
>
> --
> Richard Quadling
> Twitter : EE : Zend : PHPDoc
> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea
>

A while back, I did a project in Java where parts of it had to access and
process streams.  I guess this would be a really good exercise for me to see
if it's doable in PHP also.  In addition to implement 'PHP simulating
threads', in this scenario, where 1 thread would download and monitor the
streams while another thread would analyze/process the data and import into
a database.