Re: [PATCH] git.txt: document limitations on non-typical repos (and hints)

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> · Thu, 7 Oct 2010 09:25:02 +0700

On Wed, Oct 6, 2010 at 11:32 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>> +Performance concerns
>> +--------------------
>> +
>> +Git is written with performance in mind and it works extremely well
>> +with its typical repositories (i.e. source code repositories, with
>> +a moderate number of small text files, possibly with long history).
>> +Non-typical repositories (a lot of files, or very large files...)
>> +may experience mild performance degradation. This section describes
>> +how Git behaves in such repositories and how to reduce impact.
>> +
>
> I have seen this "mild" suggested in the discussion, but do we want any
> adjective here? ÂThe runtime for, say, "git log" from the tip to the root
> obviously would grow proportionally to the length of the history, i.e. the
> number of records you would want to see, and it may not be "mild" if your
> history is very deep. ÂSame for the runtime for "git diff" in a wide
> project with many changed paths.

I don't want to give an impression that the sky will fall when someone
puts a 200MB file in his repo.

> More importantly, what is "degradation"? ÂIt is not a degradation if "git
> log" took 100x as long for a project with 100k commits compared to a
> similar project with 1k commits.

>From my perspective, git commands that are instant in typical repos
should still be instant in non-typical ones. Yes "git add hugefile"
will take longer than "git add git.c", but it should not take, say, 1
hour for that command. It's hard to draw a clear line here.

> If you do not have enough core to hold the part of the ancestry graph that
> is involved to compute "git log A..B" to show a gazillion commits, it will
> eat into the swap, take a lot more time than it takes "git log B" to show
> the same number of commits. ÂThat _is_ degradation, and I suspect it won't
> be mild at all.
>
>> +For repositories with a large number of files (~50k files or more),
>
> How did you come up with this 50k number?

Quite unscientific, I started with gentoo-x86 (~130k files) which I
know git performs less than satisfactory. I also looked how big other
repos are, wine.git, linux-2.6.git... then choose a number in the
middle.

>> +but you only need a few of them present in working tree, you can use
>> +sparse checkout (see linkgit:git-read-tree[1], section 'Sparse
>> +checkout').
>
> Is "sparse checkout" a real feature that has been made usable by mere
> mortals, battle tested, and shown to be reliable?

Hopefully. In 2010 survey, there are 331 answers they use "partial
(sparse) checkout". I hope that they used this feature, not something
else.

> It feels funny that we have to refer to the documentation of plumbing
> read-tree when the key verb in this paragraph is "checkout". ÂWith the
> current documentation set, you can follow read-tree page that mentions
> some magic called skip-worktree-bit, get tempted to jump to update-index
> page and get lost in the implementation details of the feature, which is
> irrelevant to the end user. ÂIf you resisted the temptation and keep
> reading read-tree page, you see the description of info/sparse-checkout to
> learn how to control the feature, but it does not come with an
> easy-to-follow example. ÂA few concrete suggestions to "Sparse checkout"
> section in read-tree:
>
> ...
>

Hmm.. yeah. Will do something.

> I think the suggestion to use Sparse checkout in git(1)---i.e. your patch
> we are discussing here, is a bit premature before the above happens.
>
>> +... If you need all of them present in working tree, but you
>> +know in advance only a few of them may be modified, please consider
>> +using assume-unchanged bit (see linkgit:git-update-index[1]).
>> +... The following commands are
>> +however known to do full index refresh in some cases:
>
> It is "need to", not "are known to", isn't it?

In case of "git commit", as you said in another mail, index refresh is
needed because of post-commit hook. If there are no hooks, I think
index refresh can be skipped. But yes, probably "need to".

>> +Some commands need entire file content in memory to process.
>> +Files that have size a significant portion of physical RAM may
>> +affect performance. You may want to avoid using the following
>> +commands if possible on such large files:
>
> "If possible" is not a good excuse. ÂHow would one _avoid_ checkout of a
> file if one wants to use it? ÂYou can't. ÂSimilarly to "diff". ÂThis
> advice is pretty much useless, isn't it? ÂIt's not much better than saying
> "if your machine has too little RAM, things will get slow---deal with it".

That's more of bug acknowledgement, or to-be-improved TODOs. I didn't
want to say that out loud. Should I?
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html