On Mon, 17 Aug 2009, James Antill wrote:
Dimitrios Apostolou <jimis@xxxxxxx> writes:
Hello list,
I have been using fedora on various machines, many of which are fairly
old, so I'm constantly trying to remove unnecessary fat and make
things speedier. Unfortunately when the basic package manager is slow
things aren't looking too good.
Running only "yum help" on an 800MHz PC with fedora 11 needs about
2.2s. Running "yum check-update" takes more than 20s to return an
empty list.
[...]
Perhaps I shouldn't even mention how yum (old version) slowness looks
in an old sparcstation 5 running Aurora Linux. It needs hours for
performing operations and is constantly swapping. It is the most
important obstacle for using that distro on such machinery.
If that's your way of asking if we'll help you with patches to make
yum faster, then yes we will ... upto a point.
It should not surprise you that hardware from 6 to over 10 years ago,
is going to be what most people are developing or testing with.
I mentioned all that just to point out that /yum is slow/ concerning
responsiveness in general, even if it's not something you can feel on
modern machines that most people (including me) use for development.
Perhaps I should have skipped that intro...
So I've been doing some profiling on yum.
As far as "yum help" is concerned, I haven't reached any important
conclusions. Most time is consumed in ini-parsing, URL parsing and
python module initialisations.
It'd be nice to have some numbers. But I can confirm that on a
modernish machine "yum list yum" seems to take roughly a second, and
python init and ini-parsing are significant parts of that.
I am using the "Run Snake Run" program
( http://www.vrplumber.com/programming/runsnakerun/ ) which provides a
nice graphical representation of the profile so I did not pay proper
attention to raw numbers. I'll try to post numbers in the future.
I can also send screenshots. :-)
Really way too much diverse stuff to
try and improve something.
*shrug*, that's mostly what performance work is.
What I meant is that for the "yum help" case profiling didn't show a
specific bottleneck inside a function. Just that a great number of
different functions were called with no single one being a hotspot.
FYI functions to look into are
getReposFromConfig@xxxxxxxxxxx and readStartupConfig@xxxxxxxxx and
object initialisations (__init__.py?) in general.
As far as check-update goes, _buildPkgObjList@xxxxxxxxxxxxx takes by
far the most time. The current way it works is by doing one query to
sqlite returning all packages, and then manually parsing the result
for excludes and converting it to python objects, all done with
repetitive python code.
True.
Is there a reason for not using a proper SQL query for returning all
packages needed, excluding excludes?
A few reasons, but are you sure you need to try that? If you just
stop the package creation, does that help? -- ie. have simplePkgList()
return the pkgtups without creating package objects first?
Yes package creation together with excludes was a major slowdown for the
check-update case. My patch reduced runtime from 20-30s (depending on the
updates available) to 12s. I'm sure package objects are needed almost
everywhere in yum but they cost.
I can see the following comment:
# Note: If we are building the pkgobjlist, we don't exclude
# here, so that we can un-exclude later on ... if that matters.
Does that matters?
No, that comment needs to die. See the comment a couple of lines down
from it.
If we really take advantage of sqlite and build a query returning
exactly what we want, then why do we need to build separate python
PackageObject list?
I attach a patch which improves a lot the time needed for check-update
by avoiding to populate the YumSqlitePackageSack objects and by
calculating updates only using the (n,a,e,v,r) list
returned. _buildPkgObjList is not even used. For this simple case it
works so it makes me wonder...
What do you think? Is this preliminary patch in the right direction?
What do you propose for improving speed even further but not breaking
existing functionality?
Don't create returnPackageTuples() and change
PackageSack.simplePkgList(), just override simplePkgList() for
YumSqlitePackageSack().
You are of course right, thanks. I'll try to provide a patch soon.
The patch (and later versions) are incomplete, you are only
implementing include.match and exclude.match from the excluder API.
You don't implement the matching properly, as you are running the
GLOBs only on package names.
You don't implement include.match properly, the traditional behaviour
is that a package has to pass _both_ "includepkgs" and "exclude" not
either.
I have been struggling to understand the internals of yum so anything you
point out is useful, thanks for all the tips.
That's fine as a proof of concept, but you didn't mark the patches as
being that.
Sorry for not being clear with that, I'll try to make it clear: My
patches are a proof of concept, I am sure they break a lot of stuff and I
don't expect them to be incorporated in yum (at least without many
changes). I just expected to raise some discussion regarding performance.
The concept I'm trying to prove is moving more logic to SQL and reducing
python iterative code wherever possible. Since you chose to use a database
backend I think it's sensible to try to avoid python-level caching of
package objects and just use the implicit caching done by sqlite.
I doubt you've tried many exclusions, as I'm pretty sure sqlite will
fail (which is why we have the limits like PATTERNS_INDEXED_MAX).
For simple excludes like '*python*' it works but you are right, I haven't
tried many others.
You can't alter the .sqlite files as you've done in the last version
of your patch ... ie. temporary tables can't be used.
I think that .sqlite files are not being written at all, after all I have
been testing yum as non-root.
You've not given any results:
1. How long did the old SQL query take.
2. How long does the new SQL query take.
3. How long does the python pkgExcluder code take.
4. What is 2 vs. 3 for small/large exclusions.
The following measurements directly from sqlite should answer most. Sorry
for not having numbers from python right now:
$ sqlite3
/var/cache/yum/fedora/35d817e2bac701525fa72cec57387a2e3457bf32642adeee1e345cc180044c86-primary.sqlite
SQLite version 3.6.12
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .timer on
sqlite> create temp table excludedIds (pkgId text);
CPU Time: user 0.002999 sys 0.004999
sqlite> insert into excludedIds select pkgId from packages where name glob
'*python*';
CPU Time: user 0.113982 sys 0.161976
sqlite> select count(*) from packages;
13289
CPU Time: user 0.003000 sys 0.018997
sqlite> select count(*) from packages where pkgid not in (select pkgid
from excludedIds);
12825
CPU Time: user 0.211967 sys 0.149977
And for large exclusions:
sqlite> insert into excludedIds select pkgId from packages where name glob
'*p*';
CPU Time: user 0.188971 sys 0.155977
sqlite> select count(*) from excludedIds;
6425
CPU Time: user 0.001000 sys 0.001000
sqlite> select count(*) from packages where pkgid not in (select pkgid
from excludedIds);
7328
CPU Time: user 0.413937 sys 0.167974
About pkgExcluder, I am positive that it was an important slowdown inside
_packageByKeyData() called from _buildPkgObjList(). I attach a profile I
found, created with python line_profiler module.
...and as I said above, it'd be nice to know how much time is taken up
with just "package object creation" as against the select + python
exclude.
Also check-updates isn't the best thing to measure, as it's not that
simple (requiring all pkg data to be loaded) and apparently doesn't
require much more than the pkgtups for most of the data (maybe that's
true for update/install/etc. in general though).
I had also measured "yum update" performance where dependency resolving
was by far the most expensive part (resolveDeps() and _checkFileRequires()
in depsolve.py). I didn't mention it because I couldn't come out with some
patch, it was way too complex for me how resolveDeps() works. So I decided
to try a simpler case, that of "check-update", simpler but unfortunately
not that simple indeed.
You might want to come onto IRC #yum on FreeNode to talk to us
tomorrow.
I'll try to be there.
Thanks for your help,
Dimitris
P.S. What do you think about rpmsack performance? Have you seen the
other mail I sent with questions regarding its performance?
--
James Antill -- james@xxxxxxx
_______________________________________________
Yum mailing list
Yum@xxxxxxxxxxxxxxxxx
http://lists.baseurl.org/mailman/listinfo/yum
Timer unit: 1e-06 s
File: /home/jimis/dist/src/yum-git/yum/yum/sqlitesack.py
Function: _packageByKeyData at line 713
Total time: 3.14929 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
713 @profile
714 def _packageByKeyData(self, repo, pkgKey, data, exclude=True):
715 """ Like _packageByKey() but we already have the data for .pc() """
716 22080 730805 33.1 23.2 if exclude and self._pkgExcludedRKD(repo, pkgKey, data):
717 return None
718 22080 77707 3.5 2.5 if repo not in self._key2pkg:
719 4 13 3.2 0.0 self._key2pkg[repo] = {}
720 4 14 3.5 0.0 self._pkgname2pkgkeys[repo] = {}
721 22080 89619 4.1 2.8 if data['pkgKey'] not in self._key2pkg.get(repo, {}):
722 22080 1864888 84.5 59.2 po = self.pc(repo, data)
723 22080 98254 4.4 3.1 self._key2pkg[repo][pkgKey] = po
724 22080 69708 3.2 2.2 self._pkgtup2pkgs.setdefault(po.pkgtup, []).append(po)
725 22080 102270 4.6 3.2 pkgkeys = self._pkgname2pkgkeys[repo].setdefault(data['name'], [])
726 22080 39156 1.8 1.2 pkgkeys.append(pkgKey)
727 22080 76858 3.5 2.4 return self._key2pkg[repo][data['pkgKey']]
_______________________________________________
Yum mailing list
Yum@xxxxxxxxxxxxxxxxx
http://lists.baseurl.org/mailman/listinfo/yum