Re: Bad git status performance

Michael J Gruber <git@xxxxxxxxxxxxxxxxxxxx> · Fri, 21 Nov 2008 16:19:50 +0100

Jean-Luc Herren venit, vidit, dixit 21.11.2008 13:46:
> Glenn Griffin wrote:
>> On Thu, Nov 20, 2008 at 4:28 PM, Jean-Luc Herren <jlh@xxxxxx> wrote:
>>> The first 'git status' shows the same difference as the second,
>>> just the second time it's staged instead of unstaged.  Why does it
>>> take 16 seconds the second time when it's instant the first time?
>> I believe the two runs of git status need to do very different things.
>>  When run the first time, git knows the files in your working
>> directory are not in the index so it can easily say those files are
>> 'Changed but not updated' just from their existence.
> 
> I might be mistaken about how the index works, but those paths
> *are* in the index at that time.  They just have the old content,
> i.e. the same content as in HEAD.  When HEAD == index, then
> nothing is staged.
> 
> But the presence of those files alone doesn't tell you that they
> have changed.  You have to look at the content and compare it to
> the index (== HEAD in this situation) to see whether they have
> changed or not and for some reason git can do this very quickly.
> 
>> The second run
>> those files do exist in both the index and the working directory, so
>> git status first shows the files that are 'Changes to be committed'
>> and that should be fast, but additionally git status will check to see
>> if those files in your working directory have changed since you added
>> them to the index.
> 
> Which is basically the same comparision as above, just it turns
> out that they have not changed.  But even then, we're talking
> about comparing a 1 byte file in the index to a 1 byte file in the
> work tree.  That doesn't take 16 seconds, even for 100 files.
> 
> So this makes me believe it's the first step (comparing HEAD to
> the index to show staged changes) that is slow.  And when you
> compare a 1MB file to a 1 byte file, you don't need to read all of
> the big file, you can tell they're not the same right after the
> first byte.  (Even an doing stat() is enough, since the size is
> not the same.)
> 
> Another thing that came to my mind is maybe rename detection kicks
> in, even though no path vanished and none is new.  I believe
> rename detection doesn't happen for unstaged changes, which might
> explain the difference in speed.
> 
> btw, I forgot to mention that I get this with branches maint,
> master, next and pu.

Interestingly, all of

git diff --stat
git diff --stat --cached
git diff --stat HEAD

are "fast" (0.2s or so), i.e. diffing index-wtree, HEAD-index,
HEAD-wtree. Linus' threaded stat doesn't help either for status, btw (20s).

Experimenting further: Using 10 files with 10MB each (rather than 100
times 1MB) brings down the time by a factor 10 roughly - and so does
using 100 files with 100k each. Huh? Latter may be expected (10MB
total), but former (100MB total)?

Now it's getting funny: Changing your "echo >" to "echo ">>" (in your
100 files 1MB case) makes things "almost fast" again (1.3s).

OK, it's "use the source, Luke" time... Actually the part you don't see
takes the most time:
wt_status_print_updated()

And in fact I can confirm your suspicion: wt_status_print_updated()
enforces rename detection (ignoring any config). Forcing it off
(rev.diffopt.detect_rename = 0;) cuts down the 20s to 0.75s.

How about a config option status.renames (or something like -M) for status?

Michael
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html