I have been thinking about this for a while, so I wanted to get some feedback. I've been seeing a number of people interested in using Git for game development and whatnot, or otherwise committing huge files. This will occasionally wreak some havoc on our servers (GitHub) because of the memory mapping involved. Thus, we would really like to see a nicer way for Git to handle big files. There are two proposals on the GSoC page to deal with this - the 'remote alternates/lazy clone' idea and the 'sparse/narrow clone' idea. I'm wondering if instead it might be an interesting idea to concentrate on the 'stub objects' for large blobs that Jakub was talking about a few months ago: http://markmail.org/message/my4kvrhsza2yjmlt But where Git instead stores a stub object and the large binary object is pulled in via a separate mechanism. I was thinking that the client could set a max file size and when a binary object larger than that is staged, Git instead writes a stub blob like: == blob [size]\0 [sha of large blob] == Then in the tree, we give the stubbed large file a special mode or type: == 100644 blob 3bb0e8592a41ae3185ee32266c860714980dbed7 README 040000 tree 557b70d2374ae77869711cb583e6d59b8aad5e8b lib 150000 blob 502feb557e2097d38a643e336f722525bc7ea077 big-ass-file.mpeg == Sort of like a symlink, but instead of the blob it points to containing the link path, it just contains the SHA of the real blob. Then we can have a command like 'git media' or something that helps manage those, pull them down from a specified server (specified in a .gitmedia file) and transfer new ones up before a push is allowed, etc. This makes it sort of a cross between a symlink and a submodule. == .git/config [media] push-url = [aws/scp/sftp/etc server] password = [write password] token = [write token] == .gitmedia [server] pull-url = [aws/scp/sftp/etc read only url] This might be nice because all the objects would be local, so most of the changes to tools should be rather small - we can't merge/diff/blame large binary stuff really anyhow, right? Also, the really large files could be written and served over protocols that are better for large file transfer (scp, sftp, etc) - the media server could be different than the git server. Then our servers can stop choking when someone tries to add and push a 2 gig file. If two users have different settings, one would simply have the stub and the other not, the 'git media update' could check the local db first before fetching. If you change the max-file-size at some point, the trees would just either stop using the stubs (if you lowered it) for anything that now fits under the size limit, or start using stubs for files that are now over it. The workflow may go something like this: $ cd git-repo $ cp ~/huge-file.mpg . $ git media add s3://chacon-media # wrote new media server url to .gitmedia $ git add . # huge-file.mpg is larger than max-file-size (10M) and will be added as media (see 'git media') $ git status # On branch master # # Changes to be committed: # (use "git reset HEAD <file>..." to unstage) # # new file: .gitmedia # new media: huge-file.mpg # $ git push Uploading new media to s3://chacon-media Uploading media files 100% (5/5), done. New media uploaded, pushing to Git server Counting objects: 14, done. Compressing objects: 100% (9/9), done. Writing objects: 100% (10/10), 1.04 KiB, done. Total 10 (delta 4), reused 0 (delta 0) To git@xxxxxxxxxx:schacon/mediaproject.git + dbb5d00...9647674 master -> master On the client side we would have something like this: $ git clone git://github.com/schacon/mediaproject.git Initialized empty Git repository in /private/tmp/simplegit/.git/ remote: Counting objects: 270, done. remote: Compressing objects: 100% (148/148), done. remote: Total 270 (delta 103), reused 198 (delta 77) Receiving objects: 100% (270/270), 24.31 KiB, done. Resolving deltas: 100% (103/103), done. # You have unfetched media, run 'git media update' to get large media files $ git status # On branch master # # Media files to be fetched: # (use "git media update <file>..." to fetch) # # unfetched: huge-file.mpg # $ git media update Fetching media from s3://chacon-media Fetching media files 100% (1/1), done. Anyhow, you get the picture. I would be happy to try to get a proof of concept of this done, but I wanted to know if there are any serious objections to this approach to large media. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html