On Fri, Dec 31, 2004 at 01:27:49PM +0100, Farkas Levente wrote: > Daniel Veillard wrote: > > Parsing the XML file and building the associated Python objects. > > > >And before bashing XML and the cost of parsing, it's only a very small > >fraction of the time spent, building the Python strings and objects is > >the really costly part as we found with seth when doing basic tests. > >My own test led me to believe that python string interning (take a > >string from the C layer or XML and get the copy from Python own string > >implementation) is extremely costly, and of course we are manipulating > >an very large amount of strings when collecting the repodata. > > have you already made some real mesurement? of what ? yes I know exactly how long it takes libxml2 to parse the data: [root@localhost ~]# xmllint --stream --timing /var/cache/yum/base/primary.xml.gzParsing took 1094 ms using the reader at the C level, this include decompressing the archive and walking though all nodes. The main cost is to turn the parsed data into Python's internal representation as I said. > than wouldn't be useful to > implement that small portion in C? or it isn't so small part? The string interning is in the Python lib, probably in C as it's a C API as far as I can tell. And no I din't looked at python internal code. Daniel -- Daniel Veillard | Red Hat Desktop team http://redhat.com/ veillard@xxxxxxxxxx | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/