Re: incremental push/pull for large repositories

Avery Pennarun <apenwarr@xxxxxxxxx> · Mon, 12 Jul 2010 11:17:45 -0400

On Fri, Jul 9, 2010 at 11:12 PM, Enrico Weigelt <weigelt@xxxxxxxx> wrote:
> I often have situations where I've rebased branches with large files
> (10th of megabytes per file) and pushing them to the remote. Normally
> these files themselves stay untouched, but the history changes (eg.
> commits reordered, several changes in smaller files,etc).
>
> It seem that on each push, the whole branch is transferred, including
> all the large files, which already exist on the remote site. Is there
> any way to prevent this ?

I was hoping someone else would have replied to you with a brilliant
solution to this by now, but I guess not, so I'll try with my limited
knowledge.  I've seen this behaviour as well.

>From what I understand, git uses an algorithm something like this to
determine which objects need to be transmitted by a push:

- find the latest commit T on the remote side that is also in the
branch you want to push.  (This part isn't an exhaustive search, and
might be off by a few commits if both ends have new changes, but this
problem usually happens only with fetch/pull, not push.)

- on the client doing the push, get a list of all objects in all new
commits that weren't in commit T and generate and send the pack.

As you can imagine, this is terribly non-optimal.  For example, if you
use 'git revert', it uploads all the objects you reverted, even though
they obviously already existed in the remote repo.  Example:

    #!/bin/sh
    set -e
    cd /tmp
    mkdir repo
    cd repo
    git init --bare
    cd ..
    git clone repo worktree
    cd worktree
    for i in $(seq 1000); do echo $i >$i; done
    git add .
    git commit -m orig
    git push  # sends about 1000 objects
    echo
    echo
    for i in $(seq 1000); do echo $i >>$i; done
    git commit -a -m doubled
    git push  # sends about 1000 objects
    echo
    echo
    git revert --no-edit HEAD
    git push  # sends about 1000 objects (again!)

The promising looking "--thin" option to git-push doesn't help this at
all.  I don't really know what it does, but whatever it does seems to
be relatively ineffective.  (I guess that's why it's not the default.)

You can imagine lots of ways to improve this, of course.  There's a
tradeoff between searching the history for old objects (which can be
slow in a huge repo) vs. just sending them and discarding duplicates
on the remote server.  For many projects, the tradeoff is an easy one:
just send the files, since they're tiny anyway, and sending them is
much faster than exhaustively searching the history.  But as soon as
huge files or huge numbers of files start to get involved, the
situation changes fast.

bup (http://github.com/apenwarr/bup) uses a totally different method:
the server sends you its pack .idx files, and you never push any
object that matches anything in those .idx files.  That works fine in
bup, because bup repositories are virtually never repacked.  (You pay
a time/bandwidth cost when you first talk to a repo, but it's worth it
to potentially avoid re-sending gigabytes worth of data.)  But in git,
where repacking happens frequently, this wouldn't fly, because the
indexes change every time someone runs "git gc".

If you came up with a patch to do improved packing/negotiation, I bet
it would be accepted.  Of course, it would have to either be optional
or have a decent heuristic for when to enable itself, because *most*
of the time, the default git behaviour is probably as fast as
possible.

Have fun,

Avery
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html