On Fri, Feb 17, 2012 at 06:19:06PM +0100, Piotr Krukowiecki wrote: > "git update-index --refresh" with dropped cache took > real 0m3.726s > user 0m0.024s > sys 0m0.404s > [...] > The diff-index after dropping cache takes > real 0m14.095s > user 0m0.268s > sys 0m0.564s OK, that suggests to me that the real culprit is the I/O we spend in accessing the object db, since that is the main I/O that happens in the second command but not the first. > > Mostly reading (we keep a sorted index and access the packfiles via > > mmap, so we only touch the pages we need). But you're also paying to > > lstat() the directory tree, too. And you're paying to load (probably) > > the whole index into memory, although it's relatively compact compared > > to the actual file data. > > If the index is the objects/pack/*.idx files than it's 21MB Yes, that's it. Though we don't necessarily read the whole thing. The sorted list of sha1s is only a part of that. And we mmap and binary-search that, so we only have to fault in pages that are actually used in our binary search. However, we're faulting in random pages of the index in series, so it may actually have a lot of latency. You can see how expensive the I/O on the index is with something like this: [whole operation, for reference] $ sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches' $ time git diff-index HEAD real 0m2.636s user 0m0.248s sys 0m0.392s [prime the cache with just the index] $ sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches' $ time cat .git/objects/pack/*.idx >/dev/null real 0m0.288s user 0m0.000s sys 0m0.028s $ time git diff-index HEAD real 0m2.175s user 0m0.272s sys 0m0.320s So roughly 20% of the I/O time in my case went to faulting in the index. You could pre-fault in the index, which would give the OS a chance to do read-ahead caching. You can see that the combined cat and diff-index times are still lower than the raw diff-index time. You could also do them in parallel, but that will create some additional seeks as the threads fight for the disk, but may be a win in the long run because we can read bigger chunks. You can roughly simulate it by running the "cat" and the "diff-index" above in parallel. I get: real 0m2.464s user 0m0.284s sys 0m0.372s which is almost exactly the same as doing them separately (though note that this is on an SSD, so seeking is very cheap). But the bulk of the time still goes to actually retrieving the object data, so that's probably a more interesting area to focus, anyway (and if we can reduce object accesses, we reduce their lookup, too :) ). > If I understand correctly, you only need to compute sha1 on the > workdir files and compare it with sha1 files recorded in index/gitdir. > It seems that to get the sha1 from index/gitdir I need to read the > packfiles? Maybe it'd be possible to cache/index it somehow, for > example in separate and smaller file? There are two diffs going on in "git status". One is a comparison between index and worktree. In that one, you need to lstat each file to make sure the cached sha1 we have in the index is up to date. Assuming it is, you don't need to touch the file data at all. Then you compare that sha1 to the stage 0 sha1 (i.e., what we typically think of as "staged for commit"). If they match, you don't need to do more work. But the expensive diff-index we've been doing above is comparing the index to the HEAD tree. And doing that is a little trickier. The index is a flat list of files with their sha1s. But the HEAD tree is stored hierarchically. So to get the sha1 of foo/bar/baz, we have to access the root tree object, find the "foo" entry, access its tree object, find the "bar" entry, access its tree object, and then find the "baz" entry. Then we compare the sha1 of the "baz" entry to what's in the index. So what's where your I/O comes from: accessing each of the tree objects. And that fact that it isn't just "compare the HEAD and index sha1s" is that the index is stored as a list of flat files. That being said, we do have an index extension to store the tree sha1 of whole directories (i.e., we populate it when we write a whole tree or subtree into the index from the object db, and it becomes invalidated when a file becomes modified). This optimization is used by things like "git commit" to avoid having to recreate the same sub-trees over and over when creating tree objects from the index. But we could also use it here to avoid having to even read the sub-tree objects from the object db. > No, it's ext4 and the disk Seagate Barracuda 7200.12 500GB, as it > reads on the cover :) > > But IMO faster disk won't help with this - times will be smaller, but > you'll still have to read the same data, so the subdir times will be > just 2x faster than whole repo, won't it? So maybe in my case it will > go down to e.g. 2s on subdir, but for someone with larger repository > it will still be 10s... Sure. But a certain amount of I/O is going to be unavoidable to get the answer to your question. So you will never be able to achieve the warm-cache case. I'm not saying we can't improve (e.g., I think the index extension thing I mentioned above is a promising approach). But we have to be realistic about what will make things faster; if I/O is your problem, faster disk is one possible solution (especially because some of this is related to seeking and latency, an SSD is a nice improvement for cold-cache times). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html