Re: Switching from CVS to GIT

Andreas Ericsson <ae@xxxxxx> · Tue, 16 Oct 2007 07:14:56 +0200

Eli Zaretskii wrote:
Date: Mon, 15 Oct 2007 20:45:02 -0400 (EDT)
From: Daniel Barkalow <barkalow@xxxxxxxxxxxx>
cc: Alex Riesen <raa.lkml@xxxxxxxxx>, Johannes.Schindelin@xxxxxx, ae@xxxxxx, 
    tsuna@xxxxxxxxxxxxx, git@xxxxxxxxxxxxxxx, make-w32@xxxxxxx

I believe the hassle is that readdir doesn't necessarily report a README in 
a directory which is supposed to have a README, when it has a readme 
instead.

Sorry I'm asking potentially stupid questions out of ignorance: why
would you want readdir to return `README' when you have `readme'?

Because it might have been checked in as README, and since git is case
sensitive that is what it'll think should be there when it reads the
directories. If it's not, users get to see

	removed: README
	untracked: readme

and there's really no easy way out of this one, since users on a case-
sensitive filesystem might be involved in this project too, so it
could be an intentional rename, but we don't know for sure. Just
clobbering the in-git file is wrong, but overwriting a file on disk
is wrong too. git tries hard to not ever lose any data for the user.

- no acceptable level of performance in filesystem and VFS (readdir,
  stat, open and read/write are annoyingly slow)
With what libraries?  Native `stat' and `readdir' are quite fast.
Perhaps you mean the ported glibc (libgw32c), where `readdir' is
indeed painfully slow, but then you don't need to use it.
We want getting stat info, using readdir to figure out what files exist, 
for 106083 files in 1603 directories with a hot cache to take under 1s; 
otherwise "git status" takes a noticeable amount of time with a medium-big 
project, and we want people to be able to get info on what's changed 
effectively instantly. My impression is that Windows' native stat and 
readdir are plenty fast for what normal Windows programs want, but we 
actually expect reasonable performance on an unreasonably-big 
metadata-heavy input.

If that's the issue, then it's not a good idea to call `stat' and
`readdir' on Windows at all.  `stat' is a single system call on Posix
systems, while on Windows it usually needs to go out of its way
calling half a dozen system services to gather the `struct stat' info.
You need to call something like FindFirstFile, which can do the job of
`stat' and `readdir' together (and of `fnmatch', if you need to filter
only some files) in one go.  I don't know whether this will scan 100K
files under one second (maybe I will try it one of these days), but it
will definitely be faster than `readdir'+`stat' by maybe as much as an
order of magnitude.

To be honest though, there are so many places which do the readdir+stat
that I don't think it'd be worth factoring it out, especially since it
*works* on windows. It's just slow, and only slow compared to various
unices. I *think* (correct me if I'm wrong) that git is still faster
than a whole bunch of other scm's on windows, but to one who's used to
its performance on Linux that waiting several seconds to scan 10k files
just feels wrong.

We also expect to be able to make a sequence of file system operations 
such that programs starting at any time see the same database as the files 
containing the database get restructured.

Sorry, I don't understand this; please tell more about the operations,
``the same database'' issue (what database?)

The object database, located under .git/objects.

and what do you mean by
``the files containing the database get restructured''.

/* I'm on a limb here. Nicolas Pitre knows the git packfile format, so
* perhaps he'll be kind enough to correct me if I'm wrong */

The mmap() stuff is primarily convenient when reading huge packfiles. As
far as I understand it, they're ordered by some sort of delta similarity
score, so mmap()'ing 100MiB or so of a certain packfile will most likely
mean we have a couple of thousand "connected" revisions in memory. That
database gets sort of restructured as the memory-chunk that's mmap()'ed
get moved to read in the next couple of thousand revisions.

In all honesty, this doesn't matter much for already fully packed projects
unless they're significantly larger than the Linux kernel, since git is so
amazingly good at compressing large repos to a small size. Linux is ~180
MiB fully packed, and most developer's systems could just read() that
entire packfile into memory without much problem. But then again, no-one's
ever had problems supporting the "normal" cases.

--
Andreas Ericsson                   andreas.ericsson@xxxxxx
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html