Re: Sequential access of XML nodes.

Stuart Dallas <stuart@xxxxxxxx> · Mon, 26 Sep 2011 17:46:57 +0100

On 26 Sep 2011, at 17:24, Richard Quadling wrote:
> I've got a project which will be needing to iterate some very large
> XML files (around 250 files ranging in size from around 50MB to
> several hundred MB - 2 of them are in excess of 500MB).
> 
> The XML files have a root node and then a collection of products. In
> total, in all the files, there are going to be several million product
> details. Each XML feed will have a different structure as it relates
> to a different source of data.
> 
> I plan to have an abstract reader class with the concrete classes
> being extensions of this, each covering the specifics of the format
> being received and has the ability to return a standardised view of
> the data for importing into mysql and eventually MongoDB.
> 
> I want to use an XML iterator so that I can say something along the lines of ...
> 
> 1 - Instantiate the XML iterator with the XML's URL.
> 2 - Iterate the XML getting back one node at a time without keeping
> all the nodes in memory.
> 
> e.g.
> 
> <?php
> $o_XML = new SomeExtendedXMLReader('http://www.site.com/data.xml');
> foreach($o_XML as $o_Product) {
> // Process product.
> }
> 
> 
> Add to this that some of the xml feeds come .gz, I want to be able to
> stream the XML out of the .gz file without having to extract the
> entire file first.
> 
> I've not got access to the XML feeds yet (they are coming from the
> various affiliate networks around, and I'm a remote user so need to
> get credentials and the like).
> 
> If you have any pointers on the capabilities of the various XML reader
> classes, based upon this scenario, then I'd be very grateful.
> 
> 
> In this instance, the memory limitation is important. The current code
> is string based and whilst it works, you can imagine the complexity of
> it.
> 
> The structure of each product internally will be different, but I will
> be happy to get back a nested array or an XML fragment, as long as the
> iterator is only holding onto 1 array/fragment at a time and not
> caching the massive number of products per file.

As far as I'm aware, XML Parser can handle all of this.

    http://php.net/xml

It's a SAX parser so you can feed it the data chunk by chunk. You can use gzopen to open gzipped files and manually feed the data into xml_parse. Be sure to read the docs carefully because there's a lot to be aware of when parsing an XML document in pieces.

-Stuart

-- 
Stuart Dallas
3ft9 Ltd
http://3ft9.com/
-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php