Re: git status: small difference between stating whole repository and small subdirectory

Piotr Krukowiecki <piotr.krukowiecki@xxxxxxxxx> · Fri, 17 Feb 2012 18:19:06 +0100

On Thu, Feb 16, 2012 at 8:20 PM, Jeff King <peff@xxxxxxxx> wrote:
> On Thu, Feb 16, 2012 at 02:37:47PM +0100, Piotr Krukowiecki wrote:
>
>> >> $ time git status  -- .
>> >> real    0m2.503s
>> >> user    0m0.160s
>> >> sys     0m0.096s
>> >>
>> >> $ time git status
>> >> real    0m9.663s
>> >> user    0m0.232s
>> >> sys     0m0.556s
>> >
>> > Did you drop caches here, too?
>>
>> Yes I did - with cache the status takes something like 0.1-0.3s on whole repo.
>
> OK, then that makes sense. It's pretty much just I/O on the filesystem
> and on the object db.
>
> You can break status down a little more to see which is which. Try "git
> update-index --refresh" to see just how expensive the lstat and index
> handling is.

"git update-index --refresh" with dropped cache took
real	0m3.726s
user	0m0.024s
sys	0m0.404s

while "git status" with dropped cache takes
real	0m13.578s
user	0m0.240s
sys	0m0.600s

I'm not sure why it takes more than the 9s reported before - IIRC I
did the previous test in single mode under bare shell and this time
I'm testing under gnome. This or it's the effect of running
update-index :/
Now status on subdir takes 9.5s. But still the
not-much-faster-status-on-subdir rule is true.

> And then try "git diff-index HEAD" for an idea of how expensive it is to
> just read the objects and compare to the index.

The diff-index after dropping cache takes
real	0m14.095s
user	0m0.268s
sys	0m0.564s

>> > Not really. You're showing an I/O problem, and repacking is git's way of
>> > reducing I/O.
>>
>> So if I understand correctly, the reason is because git must compare
>> workspace files with packed objects - and the problem is
>> reading/seeking/searching in the packs?
>
> Mostly reading (we keep a sorted index and access the packfiles via
> mmap, so we only touch the pages we need). But you're also paying to
> lstat() the directory tree, too. And you're paying to load (probably)
> the whole index into memory, although it's relatively compact compared
> to the actual file data.

If the index is the objects/pack/*.idx files than it's 21MB

>> Is there a way to make packs better? I think most operations are on
>> workdir files - so maybe it'd be possible to tell gc/repack/whatever
>> to optimize access to files which I currently have in workdir?
>
> It already does optimize for that case. If you can make it even better,
> I'm sure people would be happy to see the numbers.

If I understand correctly, you only need to compute sha1 on the
workdir files and compare it with sha1 files recorded in index/gitdir.
It seems that to get the sha1 from index/gitdir I need to read the
packfiles? Maybe it'd be possible to cache/index it somehow, for
example in separate and smaller file?

> Mostly I think it is just the case that disk I/O is slow, and the
> operation you're asking for has to do a certain amount of it. What kind
> of disk/filesystem are you pulling off of?
>
> It's not a fuse filesystem by any chance, is it? I have a repo on an
> encfs-mounted filesystem, and the lstat times are absolutely horrific.

No, it's ext4 and the disk Seagate Barracuda 7200.12 500GB, as it
reads on the cover :)

But IMO faster disk won't help with this - times will be smaller, but
you'll still have to read the same data, so the subdir times will be
just 2x faster than whole repo, won't it? So maybe in my case it will
go down to e.g. 2s on subdir, but for someone with larger repository
it will still be 10s...

-- 
Piotr Krukowiecki
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html