Re: git status: small difference between stating whole repository and small subdirectory

Jeff King <peff@xxxxxxxx> · Fri, 17 Feb 2012 15:37:55 -0500

On Fri, Feb 17, 2012 at 06:19:06PM +0100, Piotr Krukowiecki wrote:

> "git update-index --refresh" with dropped cache took
> real	0m3.726s
> user	0m0.024s
> sys	0m0.404s
> [...]
> The diff-index after dropping cache takes
> real	0m14.095s
> user	0m0.268s
> sys	0m0.564s

OK, that suggests to me that the real culprit is the I/O we spend in
accessing the object db, since that is the main I/O that happens in the
second command but not the first.

> > Mostly reading (we keep a sorted index and access the packfiles via
> > mmap, so we only touch the pages we need). But you're also paying to
> > lstat() the directory tree, too. And you're paying to load (probably)
> > the whole index into memory, although it's relatively compact compared
> > to the actual file data.
> 
> If the index is the objects/pack/*.idx files than it's 21MB

Yes, that's it. Though we don't necessarily read the whole thing. The
sorted list of sha1s is only a part of that. And we mmap and
binary-search that, so we only have to fault in pages that are actually
used in our binary search.

However, we're faulting in random pages of the index in series, so it
may actually have a lot of latency. You can see how expensive the I/O on
the index is with something like this:

  [whole operation, for reference]
  $ sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'
  $ time git diff-index HEAD
  real    0m2.636s
  user    0m0.248s
  sys     0m0.392s

  [prime the cache with just the index]
  $ sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'
  $ time cat .git/objects/pack/*.idx >/dev/null
  real    0m0.288s
  user    0m0.000s
  sys     0m0.028s
  $ time git diff-index HEAD
  real    0m2.175s
  user    0m0.272s
  sys     0m0.320s

So roughly 20% of the I/O time in my case went to faulting in the index.
You could pre-fault in the index, which would give the OS a chance to do
read-ahead caching. You can see that the combined cat and diff-index
times are still lower than the raw diff-index time. You could also do
them in parallel, but that will create some additional seeks as the
threads fight for the disk, but may be a win in the long run because we
can read bigger chunks. You can roughly simulate it by running the "cat"
and the "diff-index" above in parallel.  I get:

  real    0m2.464s
  user    0m0.284s
  sys     0m0.372s

which is almost exactly the same as doing them separately (though note
that this is on an SSD, so seeking is very cheap).

But the bulk of the time still goes to actually retrieving the object
data, so that's probably a more interesting area to focus, anyway (and
if we can reduce object accesses, we reduce their lookup, too :) ).

> If I understand correctly, you only need to compute sha1 on the
> workdir files and compare it with sha1 files recorded in index/gitdir.
> It seems that to get the sha1 from index/gitdir I need to read the
> packfiles? Maybe it'd be possible to cache/index it somehow, for
> example in separate and smaller file?

There are two diffs going on in "git status". One is a comparison
between index and worktree. In that one, you need to lstat each file to
make sure the cached sha1 we have in the index is up to date. Assuming
it is, you don't need to touch the file data at all. Then you compare
that sha1 to the stage 0 sha1 (i.e., what we typically think of as
"staged for commit"). If they match, you don't need to do more work.

But the expensive diff-index we've been doing above is comparing the
index to the HEAD tree. And doing that is a little trickier. The index
is a flat list of files with their sha1s. But the HEAD tree is stored
hierarchically. So to get the sha1 of foo/bar/baz, we have to access the
root tree object, find the "foo" entry, access its tree object, find the
"bar" entry, access its tree object, and then find the "baz" entry. Then
we compare the sha1 of the "baz" entry to what's in the index.

So what's where your I/O comes from: accessing each of the tree objects.
And that fact that it isn't just "compare the HEAD and index sha1s" is
that the index is stored as a list of flat files.

That being said, we do have an index extension to store the tree sha1 of
whole directories (i.e., we populate it when we write a whole tree or
subtree into the index from the object db, and it becomes invalidated
when a file becomes modified). This optimization is used by things like
"git commit" to avoid having to recreate the same sub-trees over and
over when creating tree objects from the index. But we could also use it
here to avoid having to even read the sub-tree objects from the object
db.

> No, it's ext4 and the disk Seagate Barracuda 7200.12 500GB, as it
> reads on the cover :)
> 
> But IMO faster disk won't help with this - times will be smaller, but
> you'll still have to read the same data, so the subdir times will be
> just 2x faster than whole repo, won't it? So maybe in my case it will
> go down to e.g. 2s on subdir, but for someone with larger repository
> it will still be 10s...

Sure. But a certain amount of I/O is going to be unavoidable to get the
answer to your question. So you will never be able to achieve the
warm-cache case. I'm not saying we can't improve (e.g., I think the
index extension thing I mentioned above is a promising approach). But we
have to be realistic about what will make things faster; if I/O is your
problem, faster disk is one possible solution (especially because some
of this is related to seeking and latency, an SSD is a nice improvement
for cold-cache times).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html