Re: [PATCH 01/02/RFC] implement a stat cache

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 20 Apr 2008 09:03:13 -0700 (PDT)

On Sun, 20 Apr 2008, Luciano Rocha wrote:
> 
> That's a lot. Why not use a stat cache?

Well, the thing is, the OS _does_ a stat cache for us, and the one that 
the OS maintains is a lot better, in that it works across processes and is 
coherent with other processes changing things.

And the thing is, your stat cache makes the *common* cases slower. I 
didn't do a whole lot of testing, but on my machine, doing just a "git 
status" with and without your stat cache shows

	Current git 'master':
		real    0m0.302s
		real    0m0.308s
		real    0m0.314s

	With your patch:
		real    0m0.352s
		real    0m0.354s
		real    0m0.355s

iow, it slowed down the case that I think matters more (the one you're 
*supposed* to use, and people most commonly do) by 15%.

Now, admittedly, I also do think that we should generally optimize the 
slow cases more than we should care about things that are already very 
fast, so I do not think that it's wrong to say "ok, let's make the really 
fast case a bit slower, in order to not be so slow in the bad case", so in 
that sense I do not think the slowdown is disastrous.

BUT. 

I really dislike adding a cache that is there just because we do something 
stupid. We can fix the over-abundance of lstat() calls by just being 
smarter. And the smarter we are, the less the cache will help, and the 
more it will hurt. Which is the real reason why I think the cache is a 
really really bad idea: it optimizes for the wrong kind of behavior.

So we have other caches and hashes we use, like the index itself, or the 
name lookup hash into the index, or the delta cache. Maintaining those 
caches takes some effort too, but those caches aren't there because we're 
doing something stupid, they are there because they allow us to do 
something smart.

For example, the index itself actually has really important semantic 
characteristics. And while the name hashing actually improves on index 
lookup performance, I'd never have implemented it if it wasn't for the 
fact that it was also designed to allow us to do case-insensitive lookups. 
And the delta cache is not hiding stupidity, it's literally avoiding very 
expensive work that we can't avoid by being smarter.

So the stat cache is not horribly bad, but I think it's the wrong path to 
go down. 

> With these changes, my git status . in WebKit changes from 28.215s to
> 15.414s.

Of course, one reason I don't think it's such a great idea is that on 
Linux, your stat cache doesn't even then end up helping _nearly_ as much 
as it does on OS X. You see an almost 50% improvement, so the 15% 
*deprovement* may not sound like much to you. But under Linux, the numbers 
are quite different:

"git status ." with your patch:

	real    0m1.043s
	real    0m1.009s
	real    0m0.972s

With my trivial patch that just removed 2 of the 9 lstat calls:

	real    0m1.116s
	real    0m1.115s
	real    0m1.119s

IOW, it does help the "." case on Linux, but only by a fairly small 
amount. In fact, the improvement seems slightly smaller than the 
peformance degradation (~12% vs ~15%), but that is probably within the 
margin of noise, so...

So another reason to avoid the stat cache is that it's really just working 
around an OS X deficiency.

I'd rather work at avoiding more lstat calls. I know we can do it.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html