Re: git-status performance with submodules

Junio C Hamano <gitster@xxxxxxxxx> · Sun, 01 Dec 2019 22:50:29 -0800

"D. Ben Knoble" <ben.knoble@xxxxxxxxx> writes:

> ### What I am curious about
>
> From the traces (attached), it appears that git-status suffers from a lack of
> (possibly embarrassing) parallelism: I would expect each submodule to be
> independently check-able, ...
> ...
> What can we do to fix this? Is there a reason for this (really terribly slow)
> serial execution? Is this something developers haven't bothered to optimize
> ("unexpected use case")? If so, I would like to discuss taking a crack at it,
> because I do have at least one repository with this many submodules, and I
> care about its performance.

Nice to hear from somebody who cares about improving submodule
support.  I offhand do not think of a reason why we inherently have
to process them serially.

But the way "git status" code is structured, it probably takes a bit
of preparatory refactoring.  If I recall correctly, it walks each
path in the index in the superproject and notes how the file in the
working tree is different from that of the index and the HEAD, under
the assumption that inspection of each path is relatively cheap and
at the same cost.  You'd first need to restructure that part so that
inspecting groups of index entries can be sharded to separate
subprocesses while the parent process waits, and have them report to
the parent process, and let the parent process continue with the
aggregated result, or something like that.

Thanks.