Re: 'git status' is not read-only fs friendly

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 10 Feb 2007 09:37:58 -0800 (PST)

On Sat, 10 Feb 2007, Nicolas Pitre wrote:
> > >
> > > Because git-status itself is conceptually a read-only operation, and 
> > > having it barf on a read-only file system is justifiably a bug.
> > 
> > I do not 100% agree that it is conceptually a read-only operation.
> 
> It is.

It really isn't. 

It's not even a "technical issue". It's a fundamental optimization. Sure, 
you can call optimizations just "technical issues", but the fact is, it's 
one of the things that makes git so _usable_ on large archives. At some 
point, an "optimization" is no longer just about making things slightly 
faster, it's about something much bigger, and has real semantic meaning.

So the fact is, "git status" _needs_ to refresh the index. Because if it 
doesn't, you'll see every file that doesn't match the index as "dirty", 
and that is not just a "technical issue".

And yes, doing an "internal" refresh, like Junio's patch does, hides the 
issue, but it hides it BY MAKING THE OPTIMIZATION POINTLESS!

I suspect Marco is testing some reasonably small git archive. With 
something like git itself, with less than a thousand files (and most of 
them fairly small, so rehashing them all is quick), the optimization may 
_feel_ like just a small technical detail.

Now, try the same thing on the Linux kernel or somethign similar, 
especially with cold caches or not a huge amount of memory.

That "technical issue" is what makes "git status" take less than a second 
for me, and only a bit longer if things aren't cached - because we don't 
actually have to read all the file data. 

Now, it so happens that _if_ things are cached, at least under Linux, 
cached IO is so _incredibly_ fast that you won't even realize how 
expensive an operation you missed. I can SHA1 every file in the kernel 
archive (21432 files right now - 8 million LOC, and 230MB of data) in less 
than a couple of seconds. But that's only because it's all cached for me 
anyway, because I tend to run with lots of RAM, and I do things like "git 
grep so-and-so" which brings it all into cache.

But try the same thing without caches.

Here's something you can do under linux:

	sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
	git read-tree HEAD
	time git update-index --refresh

and it takes me *40* seconds. That's with quite a fast disk too - it would 
take a whole lot longer on a laptop.

Then, try it _without_ having to actually read all files, because the 
index is already up-to-date:

	sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
	time git update-index --refresh

and not it took *4* seconds. That's because it didn't actually need to 
read any file data, it could just do the stats.

Then, cached:

	# bring it all in again
	git grep something-or-other

	# invalidate the index cache
	git read-tree HEAD
	time git update-index --refresh

and I can do it under *2* seconds - because Linux is just damn good at 
cached IO, so I can read all those 21-thousand files and 235MB of data 
from the kernel cache in less than a second.

But finally, do it with caches _and_ the index in place:

	time git update-index --refresh

and it now takes 0.06 seconds. It's what allows me to do "git diff" on the 
kernel tree in a tenth of a second.

THIS IS NOT "JUST A TECHNICAL ISSUE". 

When the difference is 40 seconds vs 4 (uncached), or 2 seconds vs 0.06, 
it's not about "just an optimization" any more. At that point, it's about 
"unusable vs usable".

And yeah, waiting 40 seconds for a global "diff" for a big project may be 
something that a person coming from CVS considers to be just par for the 
course. Maybe I'm just unreasonable. But I think it's a _bug_ if I can't 
get a small diff in about a tenth of a second. It needs to be so fast that 
I never even _think_ about it.

And the index is what makes it so. 

And that's why it's important to keep the index up-to-date. If we have 
operations that allow the index to just *stay* non-coherent, like the 
suggested "git runstatus --refresh" that doesn't actually write it back, 
then that's a *bad* thing. 

I think it would be much better if "git status" always wrote the refreshed 
index file. It could then choose to ignore any errors if they happen, 
because if you have a broken setup like the NTFS read-only thing, then 
tough, it's broken, but git can't do anythign about it. But people should 
be aware that yes, "git status" absolutely _needs_ to write the index 
file. 

It is *not* a read-only operation. The index is too important to be 
considered "just a technical issue". 

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html