Re: [RFC PATCH] git.txt: document limitations on non-typical repos (and hints)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



2010/10/5 Chris Packham <judge.packham@xxxxxxxxx>:
> On 05/10/10 06:00, Nguyán ThÃi Ngác Duy wrote:
>>
>> Signed-off-by: Nguyán ThÃi Ngác Duy <pclouds@xxxxxxxxx>
>> ---
>> ÂI wanted to make a more detailed description, per command. It would
>> Âserve as guidance for people on special repos, also as TODOs for Git
>> Âdevelopers. But that seems a lot of work on analyzing each commands.
>>
>> ÂInstead I made this text to warn users where performance may decrease,
>> Âand to hint them features that might help. Do I miss anything?
>>
>> ÂThere were discussions in the past on maintaining large files out-of-repo,
>> Âand symlinks to them in-repo. That sounds like a good advice, doesn't it?
>>
>> ÂDocumentation/git.txt | Â 46 ++++++++++++++++++++++++++++++++++++++++++++++
>> Â1 files changed, 46 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/git.txt b/Documentation/git.txt
>> index dd57bdc..8408923 100644
>> --- a/Documentation/git.txt
>> +++ b/Documentation/git.txt
>> @@ -729,6 +729,52 @@ The index is also capable of storing multiple entries (called "stages")
>> Âfor a given pathname. ÂThese stages are used to hold the various
>> Âunmerged version of a file when a merge is in progress.
>>
>> +Performance concerns
>> +--------------------
>> +
>> +Git is written with performance in mind and it works extremely well
>> +with its typical repositories (i.e. source code repositories, with
>> +a moderate number of small text files, possibly with long history).
>> +Non-typical repositories (huge number of files, or very large
>> +files...) may experience performance degradation. This section describes

Probably should have written "experience mild performance degradation"

>> +how Git behaves in such repositories and how to reduce impact.
>
> How huge is "huge" and how large is "large". From previous threads on
> this list I'm guessing "large" is files bigger than physical RAM. I've

A significant portion of RAM is enough to start swapping. There's also
a hard limit imposed by mmap(): a file cannot be larger than available
address space (2-3G on x86, probably larger on x86_64).

> not really run into a situation where a huge number of files causes
> performance problems.

gentoo-x86 has ~100k files. Cold cache time is definitely long. Even
with hot cache, a full cache refresh may take, I don't remember, half
a second or so. It depends on many factors. I don't think I can draw a
clear limit.

>
> Maybe there should be a distinction of where a user might see
> performance problems e.g. initial clone, subsequent fetches, commit,
> checkout or diff.
>
>> +
>> +For repositories with really long history, you may want to work on
>> +a shallow clone of it (see linkgit:git-clone[1], option '--depth').
>> +A shallow repository does not contain full history, so it may consume
>> +less disk space and network bandwidth. On the other hand, you cannot
>> +fetch from it. And obviously you cannot look further back than what
>> +it has in history (you can deepen history though).
>
> You might want to mention git clone --reference and the
> .git/objects/info/alternates for those concerned with disk usage.

Thanks

>
>> +
>> +For repositories with a large number of files, but you only need
>> +a few of them present in working tree, you can use sparse checkout
>> +(see linkgit:git-read-tree[1], section 'Sparse checkout'). Sparse
>> +checkout can be used with either a normal repository, or a shallow
>> +one.
>> +
>> +Git uses lstat(3) to detect changes in working tree. A huge number
>> +of lstat(3) calls may impact performance, especially on systems with
>> +slow lstat(3). In some cases you can reduce the number of lstat(3)
>> +calls by specifying what directories you are interested in, so no
>> +lstat(3) outside is needed.
>> +
>> +For repositories with a large number of files, you need all of them
>> +present in working tree, but you know in advance only a few of them
>> +may be modified, please consider using assume-unchanged bit (see
>> +linkgit:git-update-index[1]). This helps reduce the number of lstat(3)
>> +calls.
>> +
>> +Some Git commands need entire file content in memory to process.
>> +You may want to avoid using them if possible on large files. Those
>> +commands include:
>> +
>> +* All checkout commands (linkgit:git-checkout[1],
>> + Âlinkgit:git-checkout-index[1], linkgit:git-read-tree[1],
>> + Âlinkgit:git-clone[1]...)
>> +* All diff-related commands (linkgit:git-diff[1],
>> + Âlinkgit:git-log[1] with diff, linkgit:git-show[1] on commits...)
>> +* All commands that need file conversion processing
>> +
>
> This addresses one of my comments above. It might be worth talking about
> using git bundles as an alternative to cloning over unreliable connections.

Thanks.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]