Here's my idea. If you've already thought of this and rejected it, I'd love to know why. 1. Switch yum to use -C by default, introduce a different flag to signal an update to the cache. Thus day to day operations can be done with a pregenerated hash and a prebuilt cache. 2. Build a serializeable hash structure, or utilize an existing mechanism. I don't care if it's stored in XML or db4 or just a plain text file, but make it fast to read in and write out. Just this will make N fopen, fread, fclose calls to 1 set. Won't do much for memory usage though. 3. Change the yum cron job to use the new -!C flag so that that hash index gets generated nightly. As I don't think this would be too difficult, I'll try generating some patches against HEAD, though as a non-Python programmer it may look a little Perl-ish. Joseph seth vidal wrote: >>First idea: I remember that hashes are fast to search, but, >>comparatively, very slow to grow. To overcome this, most >>hash libraries allow to define an initial size, which should >>be best guessed large enough to accomodate all the entries, >>to avoid frequent time consuming resizes while filling. >>Does your package offer this feature? > > > Have you programmed in python before? A simple python dict is what I'm > talking about. You can build up that sort of dict and traverse it but > you still have to: > > open the package > get the data you want > put the data in the dict > close the package > > Doesn't sound too bad - but the process for opening and looking through > a package does take some time. > > > >>Second idea: you mentioned package traversal as time consuming. >>Is this time spent to open each package as a DB, grab the >>info, close it? If this is the case, have you then considered >>building a cache of package contents, which can be updated >>and used in subsequent runs, to take advantage that most >>(if not all) the packages do not change between yum runs? > > > In this case it's opening up each header, getting the data and moving > along, but yes, it can take some time to search each one. > > Where do you store that cache? How do you store it? How do you update it > to make sure it's not out of sync with the repository w/o reindexing all > the headers/packages? Feel free to answer any/all of those questions. > > Some of these have already been addressed - many of them is why I spent > so much time working on the xml-metadata to sort out > easier/faster/better ways of indexing the packages so yum can: > > 1. know if there are changes > 2. more easily traverse the packages and the metadata > 3. have smaller amounts of data to download and sort through on any run. > > Right now I'm making those changes work then I'm going to focus on > trimming time out of each session. It will still be some time b/c I'm > working on this as I can. > > If you want to be a big help, don't look at improvements for speedups to > the 2.0.X branch. I don't want to spend more time on 2.0.X if it is at > all possible. A lot of things in the structure has changed and cvs-HEAD > is where I'm trying to work the most. When I have a snapshot that does > some useful things I'll be sure to announce it here and yum-devel. > > If you're a python programmer and you're familiar with libxml2 - then > take a look at http://linux.duke.edu/metadata/generate/ - feel free to > make that code: > > 1. look for an existent repodata dir > 2. if it finds one - use the xml files there to speed up the update > creation of the new metadata for that repository. > > -sv > > > _______________________________________________ > Yum mailing list > Yum@xxxxxxxxxxxxxxxxxxxx > https://lists.dulug.duke.edu/mailman/listinfo/yum