On Thu, Sep 29, 2011 at 3:27 PM, Tommy Pham <tommyhp2@xxxxxxxxx> wrote: > On Thu, Sep 29, 2011 at 9:09 AM, Richard Quadling <rquadling@xxxxxxxxx>wrote: > >> Hi. >> >> I'm looking to process very large XML files without the need of first >> downloading them. >> >> To that end, SimpleXMLIterator('compress.zlib:// >> http://www.site.com/products.xml.gz') >> is working perfectly. >> >> But a downside is that I have no information of my progress. >> >> Is there any mechanism available to get a position within the XML stream? >> >> I can use libxml_set_streams_context() to set a context (so I can >> provide POST data if needed for the HTTP request), but I can't see how >> to gain access to the stream within the libxml code (SimpleXML uses >> libxml). >> >> At the most basic, what I'm looking for is to be able to record a >> percentage complete. Even with compression, I'll be reading some bytes >> from a stream (either from http or from compress.zlib) and I want to >> know where I am in that stream. >> >> The HTTP header will tell me how big the file is (so that can easily >> be a HEAD request to get that data). >> >> >> Even if I DO save the file locally first, I still can't get a position. >> >> If I use the SimpleXMLIterator::count() method, I am unsure as to what >> will happen if I am using a stream (rather than a local file). If I >> use ... >> >> $xml = new SimpleXMLIterator(...); >> $items = $xml->count(); >> foreach($xml as $s_Tag => $o_Item) { >> ... >> } >> >> will the XML file be cached somewhere? Or will that depend upon the >> originating server supporting some sort of rewind/chunk mechanism? >> >> >> >> Any suggestions/ideas? >> >> >> >> Richard. >> >> >> -- >> Richard Quadling >> Twitter : EE : Zend : PHPDoc >> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea >> >> > Richard, > > Only think I can think of is break up into 2 parts. 1st part use cURL for > the down streams and monitor it's progress. 2nd part is XML parsing. If > you want to do both simultaneously and only if the remote supports chunked > encoding/transfer, you could use multiple handles of cURL into numerical > indexed array variable containing the chunks for XML parsing. This could be > memory intensive if the file is very large, depending whether you free the > elements in the array as you progress. So if memory consumption is a major > concern, you may want to save to the file locally until the import/migration > is done as there maybe a situation where you'll need to review the data and, > thus, saving the time and bandwidth of having to re-download the data. > > Regards, > Tommy > An afterthought, handling streams and parsing at the same may require sophisticated XML node validations as the node maybe split between the transferred chunks.