On Fri, Sep 30, 2011 at 3:03 AM, Richard Quadling <rquadling@xxxxxxxxx>wrote: > On 29 September 2011 23:34, Tommy Pham <tommyhp2@xxxxxxxxx> wrote: > > On Thu, Sep 29, 2011 at 3:27 PM, Tommy Pham <tommyhp2@xxxxxxxxx> wrote: > >> > >> On Thu, Sep 29, 2011 at 9:09 AM, Richard Quadling <rquadling@xxxxxxxxx> > >> wrote: > >>> > >>> Hi. > >>> > >>> I'm looking to process very large XML files without the need of first > >>> downloading them. > >>> > >>> To that end, > >>> SimpleXMLIterator('compress.zlib://http://www.site.com/products.xml.gz > ') > >>> is working perfectly. > >>> > >>> But a downside is that I have no information of my progress. > >>> > >>> Is there any mechanism available to get a position within the XML > stream? > >>> > >>> I can use libxml_set_streams_context() to set a context (so I can > >>> provide POST data if needed for the HTTP request), but I can't see how > >>> to gain access to the stream within the libxml code (SimpleXML uses > >>> libxml). > >>> > >>> At the most basic, what I'm looking for is to be able to record a > >>> percentage complete. Even with compression, I'll be reading some bytes > >>> from a stream (either from http or from compress.zlib) and I want to > >>> know where I am in that stream. > >>> > >>> The HTTP header will tell me how big the file is (so that can easily > >>> be a HEAD request to get that data). > >>> > >>> > >>> Even if I DO save the file locally first, I still can't get a position. > >>> > >>> If I use the SimpleXMLIterator::count() method, I am unsure as to what > >>> will happen if I am using a stream (rather than a local file). If I > >>> use ... > >>> > >>> $xml = new SimpleXMLIterator(...); > >>> $items = $xml->count(); > >>> foreach($xml as $s_Tag => $o_Item) { > >>> ... > >>> } > >>> > >>> will the XML file be cached somewhere? Or will that depend upon the > >>> originating server supporting some sort of rewind/chunk mechanism? > >>> > >>> > >>> > >>> Any suggestions/ideas? > >>> > >>> > >>> > >>> Richard. > >>> > >>> > >>> -- > >>> Richard Quadling > >>> Twitter : EE : Zend : PHPDoc > >>> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea > >>> > >> > >> Richard, > >> > >> Only think I can think of is break up into 2 parts. 1st part use cURL > for > >> the down streams and monitor it's progress. 2nd part is XML parsing. > If > >> you want to do both simultaneously and only if the remote supports > chunked > >> encoding/transfer, you could use multiple handles of cURL into numerical > >> indexed array variable containing the chunks for XML parsing. This > could be > >> memory intensive if the file is very large, depending whether you free > the > >> elements in the array as you progress. So if memory consumption is a > major > >> concern, you may want to save to the file locally until the > import/migration > >> is done as there maybe a situation where you'll need to review the data > and, > >> thus, saving the time and bandwidth of having to re-download the data. > >> > >> Regards, > >> Tommy > > > > An afterthought, handling streams and parsing at the same may require > > sophisticated XML node validations as the node maybe split between the > > transferred chunks. > > > > The SimpleXMLIterator() does do a superb job of providing 1 node at a > time and doesn't remember each node automatically. If I do, then > that's my issue. > > Accessing the stream/file meta data doesn't seem possible. > > > -- > Richard Quadling > Twitter : EE : Zend : PHPDoc > @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea > A while back, I did a project in Java where parts of it had to access and process streams. I guess this would be a really good exercise for me to see if it's doable in PHP also. In addition to implement 'PHP simulating threads', in this scenario, where 1 thread would download and monitor the streams while another thread would analyze/process the data and import into a database.