Re: Sequential access of XML nodes.

Adam Richardson <simpleshot@xxxxxxxxx> · Mon, 26 Sep 2011 14:17:43 -0400

On Mon, Sep 26, 2011 at 12:24 PM, Richard Quadling <rquadling@xxxxxxxxx>wrote:

> Hi.
>
> I've got a project which will be needing to iterate some very large
> XML files (around 250 files ranging in size from around 50MB to
> several hundred MB - 2 of them are in excess of 500MB).
>
> The XML files have a root node and then a collection of products. In
> total, in all the files, there are going to be several million product
> details. Each XML feed will have a different structure as it relates
> to a different source of data.
>
> I plan to have an abstract reader class with the concrete classes
> being extensions of this, each covering the specifics of the format
> being received and has the ability to return a standardised view of
> the data for importing into mysql and eventually MongoDB.
>
> I want to use an XML iterator so that I can say something along the lines
> of ...
>
> 1 - Instantiate the XML iterator with the XML's URL.
> 2 - Iterate the XML getting back one node at a time without keeping
> all the nodes in memory.
>
> e.g.
>
> <?php
> $o_XML = new SomeExtendedXMLReader('http://www.site.com/data.xml');
> foreach($o_XML as $o_Product) {
>  // Process product.
> }
>
>
> Add to this that some of the xml feeds come .gz, I want to be able to
> stream the XML out of the .gz file without having to extract the
> entire file first.
>
> I've not got access to the XML feeds yet (they are coming from the
> various affiliate networks around, and I'm a remote user so need to
> get credentials and the like).
>
> If you have any pointers on the capabilities of the various XML reader
> classes, based upon this scenario, then I'd be very grateful.
>
>
> In this instance, the memory limitation is important. The current code
> is string based and whilst it works, you can imagine the complexity of
> it.
>
> The structure of each product internally will be different, but I will
> be happy to get back a nested array or an XML fragment, as long as the
> iterator is only holding onto 1 array/fragment at a time and not
> caching the massive number of products per file.
>
> Thanks.
>
> Richard.
>
>
> --
> Richard Quadling
> Twitter : EE : Zend : PHPDoc
> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
I believe the XMLReader allows you to pull node by node, and it's really
easy to work with:
http://www.php.net/manual/en/intro.xmlreader.php

In terms of dealing with various forms of compression, I believe you con use
the compression streams to handle this:
http://stackoverflow.com/questions/1190906/php-open-gzipped-xml
http://us3.php.net/manual/en/wrappers.compression.php

Adam

-- 
Nephtali:  A simple, flexible, fast, and security-focused PHP framework
http://nephtaliproject.com