rev-list default order sometimes very slow (size and metadata dependent)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

 

As part of my company's migration from SVN to git, we discovered a

performance issue in rev-list with large repositories.  The issue appears to

be metadata-dependent; we were able to work around it (completely avoiding

any performance penalty) by changing the date of certain commits.

 

The general structure of our repository is I think fairly normal (if large

-- we have >5.5 million commits total).  We have a handful of trunk

branches, and ~10k total refs.  To reduce the ref count (we hit other

performance issues when we had significantly more refs), we remove refs as

we're done with them.  Any code that doesn't make it into a trunk is

preserved in an archive branch.  The archive branch has no content, and

consists entirely of octopus merges with 50-500 parents.

 

If the archive branch is created with author/commit dates older than the

rest of the repository, we're able to run:

  $ git rev-list --count --all

in ~9-10 seconds on a mirror clone with a commit-graph.  However, if the

archive branch is instead created with author/commit dates newer than the

rest of the repository, it takes 4-5 minutes.

 

Using any order other than the default or --reverse removes the disparity.

All orders except --author-date-order bring things much closer to the ~9-10

seconds we see with the workaround, and --author-date-order is still under a

minute (though not by much).

 

System info from git bugreport:

  [System Info]

  git version:

  git version 2.42.0.windows.2

  cpu: x86_64

  built from commit: 2f819d1670fff9a1818f63b6722e9959405378e3

  sizeof-long: 4

  sizeof-size_t: 8

  shell-path: /bin/sh

  feature: fsmonitor--daemon

  uname: Windows 10.0 19044

  compiler info: gnuc: 13.2

  libc info: no libc information available

  $SHELL (typically, interactive shell): C:\Program Files\Git\usr\bin\bash.exe

 

  (no enabled hooks)

 

Note that we first realized this was an issue on our GitLab instance, which

runs on Linux, so this is not a Windows-specific bug.

 

I created a bash script to create very similar repositories that are/are not

affected by the issue; it follows.  The issue starts to become visible at 1

million commits (the default), where the difference is ~2x.  5 million

commits is roughly equivalent performance-wise to what we saw in our

repository, with a difference of ~33x.  Note that with 5 million commits,

each repository is ~1.2 GB and takes 7-8 minutes to create on an i9-9900

with NVMe storage.

 

Once you create a fast and a slow repo with the script, try the following

commands in each one:

  # Shows the performance difference

  $ time git rev-list --count –all

  # Shows very similar performance across both repos

  $ time git rev-list --count --all --topo-order

 

Thank you,

Kevin Lyles

klyles@xxxxxxxx

----------------------------------------

 

#!/bin/bash

 

usage="Usage: $0 <destination folder> <--fast|--slow> [Number of commits (default: 1000000)]"

destinationFolder=${1:?$usage}

 

oldTimestamp=315554400 # 1980-01-01 midnight

newTimestamp=1672552800 # 2023-01-01 midnight

if [ "$2" == "--fast" ]

then

    archiveTimestamp=$oldTimestamp

elif [ "$2" == "--slow" ]

then

    archiveTimestamp=$newTimestamp

else

    echo "$usage" >&2

    exit 1

fi

 

numberOfCommits=${3:-1000000}

if ! [[ "$numberOfCommits" =~ ^[0-9]+$ ]]

then

    echo "$usage" >&2

    exit 1

fi

 

increment=$(( (newTimestamp - oldTimestamp) / (numberOfCommits + 2) ))

 

timestamp=$oldTimestamp

 

rm -rf "$destinationFolder"

git init "$destinationFolder"

 

echo "Fast-importing repo, please wait..."

{

    echo "feature done"

    echo "reset refs/heads/main"

    echo ""

 

    for count in $(seq "$numberOfCommits")

    do

        timestamp=$(( timestamp + increment ))

        echo "commit refs/heads/main"

        echo "mark :$count"

        echo "committer Test Test <test@xxxxxxxx> $timestamp -0500"

        echo "data <<|"

        echo "Main branch commit #$count"

        echo "|"

        echo ""

    done

 

    parentMark=0

    echo "reset refs/archive"

    for count in $(seq $(( numberOfCommits / 1000 )))

    do

        echo "commit refs/archive"

        echo "committer Test Test <test@xxxxxxxx> $archiveTimestamp -0500"

        echo "data <<|"

        echo "Archive branch commit #$count"

        echo "|"

        for parentCount in {1..50}

        do

            parentMark=$(( (parentMark + 99991) % numberOfCommits + 1 ))

            echo "merge :$parentMark"

        done

        echo ""

    done

 

    echo "done"

} | git -C "$destinationFolder" fast-import

 

git -C "$destinationFolder" commit-graph write

<<attachment: smime.p7s>>


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux