On Mon, Sep 26, 2011 at 12:24 PM, Richard Quadling <rquadling@xxxxxxxxx>wrote: > Hi. > > I've got a project which will be needing to iterate some very large > XML files (around 250 files ranging in size from around 50MB to > several hundred MB - 2 of them are in excess of 500MB). > > The XML files have a root node and then a collection of products. In > total, in all the files, there are going to be several million product > details. Each XML feed will have a different structure as it relates > to a different source of data. > > I plan to have an abstract reader class with the concrete classes > being extensions of this, each covering the specifics of the format > being received and has the ability to return a standardised view of > the data for importing into mysql and eventually MongoDB. > > I want to use an XML iterator so that I can say something along the lines > of ... > > 1 - Instantiate the XML iterator with the XML's URL. > 2 - Iterate the XML getting back one node at a time without keeping > all the nodes in memory. > > e.g. > > <?php > $o_XML = new SomeExtendedXMLReader('http://www.site.com/data.xml'); > foreach($o_XML as $o_Product) { > // Process product. > } > > > Add to this that some of the xml feeds come .gz, I want to be able to > stream the XML out of the .gz file without having to extract the > entire file first. > > I've not got access to the XML feeds yet (they are coming from the > various affiliate networks around, and I'm a remote user so need to > get credentials and the like). > > If you have any pointers on the capabilities of the various XML reader > classes, based upon this scenario, then I'd be very grateful. > > > In this instance, the memory limitation is important. The current code > is string based and whilst it works, you can imagine the complexity of > it. > > The structure of each product internally will be different, but I will > be happy to get back a nested array or an XML fragment, as long as the > iterator is only holding onto 1 array/fragment at a time and not > caching the massive number of products per file. > > Thanks. > > Richard. > > > -- > Richard Quadling > Twitter : EE : Zend : PHPDoc > @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > > I believe the XMLReader allows you to pull node by node, and it's really easy to work with: http://www.php.net/manual/en/intro.xmlreader.php In terms of dealing with various forms of compression, I believe you con use the compression streams to handle this: http://stackoverflow.com/questions/1190906/php-open-gzipped-xml http://us3.php.net/manual/en/wrappers.compression.php Adam -- Nephtali: A simple, flexible, fast, and security-focused PHP framework http://nephtaliproject.com