Git Large Object Support Proposal

Scott Chacon <schacon@xxxxxxxxx> · Thu, 19 Mar 2009 15:14:52 -0700

I have been thinking about this for a while, so I wanted to get some
feedback. I've been seeing a number of people interested in using Git
for game development and whatnot, or otherwise committing huge files.
This will occasionally wreak some havoc on our servers (GitHub)
because of the memory mapping involved.  Thus, we would really like to
see a nicer way for Git to handle big files.

There are two proposals on the GSoC page to deal with this - the
'remote alternates/lazy clone' idea and the 'sparse/narrow clone'
idea.  I'm wondering if instead it might be an interesting idea to
concentrate on the 'stub objects' for large blobs that Jakub was
talking about a few months ago:

http://markmail.org/message/my4kvrhsza2yjmlt

But where Git instead stores a stub object and the large binary object
is pulled in via a separate mechanism. I was thinking that the client
could set a max file size and when a binary object larger than that is
staged, Git instead writes a stub blob like:

==
blob [size]\0
[sha of large blob]
==

Then in the tree, we give the stubbed large file a special mode or type:

==
100644 blob 3bb0e8592a41ae3185ee32266c860714980dbed7 README
040000 tree 557b70d2374ae77869711cb583e6d59b8aad5e8b lib
150000 blob 502feb557e2097d38a643e336f722525bc7ea077 big-ass-file.mpeg
==

Sort of like a symlink, but instead of the blob it points to
containing the link path, it just contains the SHA of the real blob.
Then we can have a command like 'git media' or something that helps
manage those, pull them down from a specified server (specified in a
.gitmedia file) and transfer new ones up before a push is allowed,
etc.  This makes it sort of a cross between a symlink and a submodule.

== .git/config
[media]
    push-url = [aws/scp/sftp/etc server]
    password = [write password]
    token = [write token]

== .gitmedia
[server]
    pull-url = [aws/scp/sftp/etc read only url]

This might be nice because all the objects would be local, so most of
the changes to tools should be rather small - we can't
merge/diff/blame large binary stuff really anyhow, right?  Also, the
really large files could be written and served over protocols that are
better for large file transfer (scp, sftp, etc) - the media server
could be different than the git server.  Then our servers can stop
choking when someone tries to add and push a 2 gig file.

If two users have different settings, one would simply have the stub
and the other not, the 'git media update' could check the local db
first before fetching.  If you change the max-file-size at some point,
the trees would just either stop using the stubs (if you lowered it)
for anything that now fits under the size limit, or start using stubs
for files that are now over it.

The workflow may go something like this:

$ cd git-repo
$ cp ~/huge-file.mpg .
$ git media add s3://chacon-media
# wrote new media server url to .gitmedia
$ git add .
# huge-file.mpg is larger than max-file-size (10M) and will be added
as media (see 'git media')
$ git status
# On branch master
#
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#	new file:   .gitmedia
#	new media:   huge-file.mpg
#
$ git push
Uploading new media to s3://chacon-media
Uploading media files 100% (5/5), done.
New media uploaded, pushing to Git server
Counting objects: 14, done.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (10/10), 1.04 KiB, done.
Total 10 (delta 4), reused 0 (delta 0)
To git@xxxxxxxxxx:schacon/mediaproject.git
 + dbb5d00...9647674 master -> master

On the client side we would have something like this:

$ git clone git://github.com/schacon/mediaproject.git
Initialized empty Git repository in /private/tmp/simplegit/.git/
remote: Counting objects: 270, done.
remote: Compressing objects: 100% (148/148), done.
remote: Total 270 (delta 103), reused 198 (delta 77)
Receiving objects: 100% (270/270), 24.31 KiB, done.
Resolving deltas: 100% (103/103), done.
# You have unfetched media, run 'git media update' to get large media files
$ git status
# On branch master
#
# Media files to be fetched:
#   (use "git media update <file>..." to fetch)
#
#	unfetched:   huge-file.mpg
#
$ git media update
Fetching media from s3://chacon-media
Fetching media files 100% (1/1), done.

Anyhow, you get the picture.  I would be happy to try to get a proof
of concept of this done, but I wanted to know if there are any serious
objections to this approach to large media.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html