Re: kde.git is now online

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 5 Apr 2007 18:24:12 -0700 (PDT)

On Thu, 5 Apr 2007, Nicolas Pitre wrote:
> 
> Well.... still it certainly can be helped a bit.  I wouldn't mind it 
> spending half an hour of CPU if it needs to.  But I just interrupted it
> with ^C with the following result so far:
> 
> real    75m44.374s
> user    2m5.318s
> sys     0m54.059s

Well, the thing is, this is "normal", and doesn't really have a lot to do 
with git.

If the actual working set is larger than available memory, ~5% CPU time is 
actually pretty good. 

The only way to improve on it is to try to make the working set smaller. 
Sadly, that's often a really difficult thing to do ;(

> > I suspect you'll find that with 1GB or RAM you'll have other 
> > performance problems that are more pressing ("git clone" comes to mind 
> > ;)
> 
> Well... same issue actually.  git-pack-objects spent about 40 secs 
> firmly at 100% CPU usage counting objects.
> 
> Then it got stuck on:
> 
> 	remote: Done counting 4111366 objects.
> 
> again spending 3% CPU and the rest waiting for IO with the disk 
> definitely trashing.

Well, I seriously doubt it's the "same issue" except in the sense that 
yes, if you work with all objects, you are going to have a big working 
set.

Note that "working set" is different from "memory footprint". If you have 
good locality, the working set can be a *lot* smaller than the memory 
footprint, and that tends to be the best/only way to improve the working 
set: trying to not jump back-and-forth between different things.

One example of that kind of shrinkage of the working set was Junios commit 
57584d9eddc3482c5db0308203b9df50dc62109c to "git blame": by comparing the 
*pointers* rather than what they pointed to, you avoid having to follow 
the pointer all the way down.

However, doing that in general tends to be very difficult. We use hashes 
extensively (not just the obvious SHA1 hashes, but the object lookup 
itself is based on hash tables etc), and while they are nice and fast O(1) 
when you have enough memory, they do tend to spread things out so that you 
are using your memory potentially very sparsely, which is the last thing 
you want to do if you are paging.

Side note: I finally got the thing downloaded, and so I did a

	git checkout -f

and the trace is pretty horrid. It looks something like this:

	...
	lstat("kdeaccessibility/IconThemes/mono/scalable/apps/kimagemapeditor.svgz", 0x7fff6f8d29f0) = -1 ENOENT (No such file or directory)
	mkdir("kdeaccessibility", 0777)         = -1 EEXIST (File exists)
	unlink("kdeaccessibility")              = -1 EISDIR (Is a directory)
	stat("kdeaccessibility", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
	mkdir("kdeaccessibility/IconThemes", 0777) = -1 EEXIST (File exists)
	unlink("kdeaccessibility/IconThemes")   = -1 EISDIR (Is a directory)
	stat("kdeaccessibility/IconThemes", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
	mkdir("kdeaccessibility/IconThemes/mono", 0777) = -1 EEXIST (File exists)
	unlink("kdeaccessibility/IconThemes/mono") = -1 EISDIR (Is a directory)
	stat("kdeaccessibility/IconThemes/mono", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
	mkdir("kdeaccessibility/IconThemes/mono/scalable", 0777) = -1 EEXIST (File exists)
	unlink("kdeaccessibility/IconThemes/mono/scalable") = -1 EISDIR (Is a directory)
	stat("kdeaccessibility/IconThemes/mono/scalable", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
	mkdir("kdeaccessibility/IconThemes/mono/scalable/apps", 0777) = -1 EEXIST (File exists)
	unlink("kdeaccessibility/IconThemes/mono/scalable/apps") = -1 EISDIR (Is a directory)
	stat("kdeaccessibility/IconThemes/mono/scalable/apps", {st_mode=S_IFDIR|0775, st_size=12288, ...}) = 0
	open("kdeaccessibility/IconThemes/mono/scalable/apps/kimagemapeditor.svgz", O_WRONLY|O_CREAT|O_EXCL, 0666) = 5
	write(5, "\37\213\10\10\205\3\263A\0\3kimagemapeditor.svg\0\344Z"..., 10112) = 10112
	close(5)                                = 0
	lstat("kdeaccessibility/IconThemes/mono/scalable/apps/kimagemapeditor.svgz", {st_mode=S_IFREG|0664, st_size=10112, ...}) = 0
	...

and that repeats for every single file. There's 233,902 of them. Oops.

On the other hand, we do certain things pretty well.  A "git diff", with
enough memory, takes 0.65s.  That's just over *half*a*second* for 233
*thousand* files.  I'd want to have tons of memory to work with this
repository, but if I did, I'd still think git is the best thing since
sliced bread. 

And doing ops like "git blame" on some random file I looked at was
actually instantaneous.  I probably happened to pick a new file just by
luck, but still..  Most things definitely work pretty damn well. 

(Update: I did a

	git log --raw -r |
		grep '^:100644.*M' |
		cut -f2 |
		sort |
		uniq -c |
		sort -n

to see the file that was updated the most, to get some kind of
worst-case for "git blame".  The list looks like:

   ...
   1091 koffice/kword/kwview.cc
   1099 kdelibs/khtml/khtml_part.cpp
   1116 koffice/kpresenter/kpresenter_view.cc
   1171 kdevelop/ChangeLog
   1667 kde-common/accounts

and while "git blame" is slow on them, it's not *painfully* so.  It took
13s to get the kdevelop/ChangeLog blame, and 31s (probably because the
diffs are much more interesting) to get the kpresenter_view.cc blame. 
Too slow, but still usable, and "git gui" again made it more interesting
to wait for it.. 

That said, the more I look at this, the more I think that this is *the*
perfect example of why you shouldn't put everything in one big
repository.  Git should be able to handle it, but nobody should really
do things like that. It's just stupid.

I will think hard about submodules.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html