Strange O(N^3) behavior in "git filter-branch"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have noticed that "git filter-branch" gets pathologically slow when it
operates on a repository that has many references in a complicated
directory hierarchy.  The time seems to go like O(N^3), where N is the
number of references being rewritten.

I have attached a Python program that I have been using to explore this
behavior.  It creates a simple linear history with hardly any content
but one reference to each commit, then runs "git-filter-branch" on all
branches using a trivial tree-filter.

The times were taken using the "maint" version of git on a notebook
computer with 4Gb of RAM running Linux 2.6.32.  The RAM usage never
seemed to expand unusually.

The part that slows down is the part near the end where it writes

    Ref 'foo' was rewritten
    Ref 'bar' was rewritten
    etc.

These lines come at ever-greater intervals and dominate the time for
running git-filter-branch for N more than 1000 or so.  The time taken
for each iteration of this loop increases quadratically, so the time for
the whole run goes like N^3.  The time that the Nth line appears goes
approximately like

    254.954 + 0.0195578*N + 3.38276e-05*N^2 + 3.45833e-08*N^3

; and the N^3 term starts to dominate around N=1000.  The 1000th step
takes about 270 ms.

To figure out where the time is going, I instrumented git-filter-branch
as shown below, and found the blocks between "1" and "2", "4b" and "5",
and "5" and 6" each take about 90 ms (and therefore almost all of the time).

The slowdown seems to depend crucially on having a complicated hierarchy
of references.  The bad case is lots of subdirectories, like

    refs/heads/a0/b0/c0000
    refs/heads/a1/b0/c0001
    refs/heads/a2/b0/c0002
    ...
    refs/heads/a9/b9/c0999

, where I have basically sharded the references based on their last two
digits.  (It is just as slow if the sharding is based on the most
significant digits.)

On the other hand, if the refs are all in one directory (even a
subdirectory) like

    refs/heads/a/b/c0000
    refs/heads/a/b/c0001
    ...
    refs/heads/a/b/c0999

, then the times go down dramatically, and appear to go from O(N^3) to
O(N^2) (though O(N^2) still seems too slow for this loop!).

If the "git pack-refs" is omitted before "git filter-branch", then the
times become considerably worse, though I have not done systematic tests
on this variation.

I tried to reproduce the slowdown in a simpler scenario using various
loops over "git update-ref", but have so far been unsuccessful.
However, I have also observed a serious slowdown in the context of
making lots of automated fixes to a git repository converted from
Subversion (involving lots of calls to git-update-ref).  The slowdown in
this case could be largely mitigated by running "git pack-refs"
periodically, so it seems that the slowdown is related to the number of
unpacked refs in complicated directory structures.  Unfortunately it
would not be trivial to insert "git pack-refs" in the "git
filter-branch" loop because the loop relies on the presence of unpacked
references to "avoid rewriting a ref twice".

Does anybody have an idea?

Michael

Instrumented git-filter-branch:
> while read ref
> do
> echo 0 @@@
> 	# avoid rewriting a ref twice
> 	test -f "$orig_namespace$ref" && continue
> 
> echo 1 @@@
> 	sha1=$(git rev-parse "$ref"^0)
> echo 2 @@@
> 	rewritten=$(map $sha1)
> 
> echo 3 @@@
> 	test $sha1 = "$rewritten" &&
> 		warn "WARNING: Ref '$ref' is unchanged" &&
> 		continue
> 
> echo 4 @@@
> 	case "$rewritten" in
> 	'')
> echo 4a @@@
> 		echo "Ref '$ref' was deleted"
> 		git update-ref -m "filter-branch: delete" -d "$ref" $sha1 ||
> 			die "Could not delete $ref"
> 	;;
> 	$_x40)
> echo 4b @@@
> 		echo "Ref '$ref' was rewritten"
> 		if ! git update-ref -m "filter-branch: rewrite" \
> 					"$ref" $rewritten $sha1 2>/dev/null; then
> 			if test $(git cat-file -t "$ref") = tag; then
> 				if test -z "$filter_tag_name"; then
> 					warn "WARNING: You said to rewrite tagged commits, but not the corresponding tag."
> 					warn "WARNING: Perhaps use '--tag-name-filter cat' to rewrite the tag."
> 				fi
> 			else
> 				die "Could not rewrite $ref"
> 			fi
> 		fi
> 	;;
> 	*)
> echo 4c @@@
> 		# NEEDSWORK: possibly add -Werror, making this an error
> 		warn "WARNING: '$ref' was rewritten into multiple commits:"
> 		warn "$rewritten"
> 		warn "WARNING: Ref '$ref' points to the first one now."
> 		rewritten=$(echo "$rewritten" | head -n 1)
> 		git update-ref -m "filter-branch: rewrite to first" \
> 				"$ref" $rewritten $sha1 ||
> 			die "Could not rewrite $ref"
> 	;;
> 	esac
> echo 5 @@@
> 	git update-ref -m "filter-branch: backup" "$orig_namespace$ref" $sha1 ||
> 		 exit
> echo 6 @@@
> done < "$tempdir"/heads

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
#! /usr/bin/python

import sys
import os
import subprocess
import shutil
import time


GIT_REPO = '/tmp/git-time-test'
FILENAME1 = os.path.join(GIT_REPO, 'a.txt')
FILENAME2 = os.path.join(GIT_REPO, 'b.txt')


N = 3000


def run(*args, **kw):
    if True:
        sys.stderr.write('Running %s\n' % (repr(args),))
    subprocess.check_call(*args, cwd=GIT_REPO, **kw)


def get_refnames(*patterns):
    cmd = ['git', 'for-each-ref', '--format=%(refname)'] + list(patterns)
    if True:
        sys.stderr.write('Running %s\n' % (repr(cmd),))
    p = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        cwd=GIT_REPO,
        )
    (out, err) = p.communicate()
    return [l.strip() for l in out.splitlines()]


def get_name(i):
    name = '%04d' % (i,)
    #return 'refs/heads/b%s' % (name,)
    return 'refs/heads/a/b/c%s' % (name,)
    #return 'refs/heads/a%s/b%s/c%s' % (name[-1], name[-2], name,)


def make_repo():
    if os.path.exists(GIT_REPO):
        shutil.rmtree(GIT_REPO)

    subprocess.check_call(['git', 'init', GIT_REPO])

    run(['git', 'config', 'user.name', 'Lou User'])
    run(['git', 'config', 'user.email', 'luser@xxxxxxxxxxx'])

    open(FILENAME1, 'w')
    run(['git', 'add', '-N', FILENAME1])

    for i in range(N):
        open(FILENAME1, 'w').write('%d\n' % (i,))
        run(['git', 'commit', '-a', '-m', 'Commit %d' % (i,)])
        #run(['git', 'update-ref', 'refs/tags/%04d' % (i,), 'HEAD'])
        run(['git', 'update-ref', get_name(i), 'HEAD'])


def pack_refs():
    run(['git', 'pack-refs', '--all', '--prune'])


def filter():
    run(['git', 'filter-branch', '--tree-filter', 'cp a.txt b.txt', '--', '--all'])


def add_refs():
    refnames = get_refnames('refs/tags')

    for refname in refnames:
        assert not refname.endswith('master')
        i = int(refname.rsplit('/', 1)[-1])
        new_refname = get_name(i)
        run(['git', 'update-ref', '-m', 'add ref', new_refname, refname])


def move_refs():
    refnames = get_refnames('refs/tags')

    for refname in refnames:
        assert not refname.endswith('master')
        i = int(refname.rsplit('/', 1)[-1])
        new_refname = get_name(N - 1 - i)
        run(['git', 'update-ref', '-m', 'move ref', new_refname, refname])


def main(args):
    make_repo()
    pack_refs()
    t = time.time()
    filter()
    #add_refs()
    #move_refs()
    print 'time to process: %.3f s' % (time.time() - t,)


main(sys.argv[1:])


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]