Sequential access of XML nodes.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi.

I've got a project which will be needing to iterate some very large
XML files (around 250 files ranging in size from around 50MB to
several hundred MB - 2 of them are in excess of 500MB).

The XML files have a root node and then a collection of products. In
total, in all the files, there are going to be several million product
details. Each XML feed will have a different structure as it relates
to a different source of data.

I plan to have an abstract reader class with the concrete classes
being extensions of this, each covering the specifics of the format
being received and has the ability to return a standardised view of
the data for importing into mysql and eventually MongoDB.

I want to use an XML iterator so that I can say something along the lines of ...

1 - Instantiate the XML iterator with the XML's URL.
2 - Iterate the XML getting back one node at a time without keeping
all the nodes in memory.

e.g.

<?php
$o_XML = new SomeExtendedXMLReader('http://www.site.com/data.xml');
foreach($o_XML as $o_Product) {
 // Process product.
}


Add to this that some of the xml feeds come .gz, I want to be able to
stream the XML out of the .gz file without having to extract the
entire file first.

I've not got access to the XML feeds yet (they are coming from the
various affiliate networks around, and I'm a remote user so need to
get credentials and the like).

If you have any pointers on the capabilities of the various XML reader
classes, based upon this scenario, then I'd be very grateful.


In this instance, the memory limitation is important. The current code
is string based and whilst it works, you can imagine the complexity of
it.

The structure of each product internally will be different, but I will
be happy to get back a nested array or an XML fragment, as long as the
iterator is only holding onto 1 array/fragment at a time and not
caching the massive number of products per file.

Thanks.

Richard.


-- 
Richard Quadling
Twitter : EE : Zend : PHPDoc
@RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux