Adam Heath wrote:
We maintain a website in git. This website has a bunch of backend server code, and a bunch of data files. Alot of these files are full videos.
First of all, I'm going to hint that you would be far better off keeping the media files in a separate repository, linked in as a submodule in git and with tweaked configuration settings with the specific aim of handling huge files. The basis of such a repository is probably the following config settings, since media files very rarely compress enough to be worth the effort, and their own compressed formats make them very unsuitable delta candidates: [pack] # disable delta-based packing depth = 1 # disable compression compression = 0 [gc] # don't auto-pack, ever auto = 0 # never automatically consolidate un-.keep'd packs autopacklimit = 0 You will have to manually repack this repository from time to time, and it's almost certainly a good idea to mark the resulting packs with .keep to avoid copying tons of data. When packs are being created, objects can be copied from existing packs, and send-pack will make use of that so that what goes over the wire will simply be copied from the existing packs. YMMV. If you do come up with settings that work fine for huge repos made up of mostly media files, please share your findings.
We use git, so that the distributed nature of website development can be supported. Quite often, you'll have a production server, with online changes occurring(we support in-browser editting of content), a preview server, where large-scale code changes can be previewed, then a development server, one per programmer(or more). Last friday, I was doing a checkin on the production server, and found 1.6G of new files. git was quite able at committing that. However, pushing was problematic. I was pushing over ssh; so, a new ssh connection was open to the preview server. After doing so, git tried to create a new pack file. This took *ages*, and the ssh connection died. So did git, when it finally got done with the new pack, and discovered the ssh connection was gone. So, to work around that, I ran git gc. When done, I discovered that git repacked the *entire* repository. While not something I care for, I can understand that, and live with it. It just took *hours* to do so.
I'm not sure what, if any, magic "git gc" applies before spawning "git repack", but running "git repack" directly would almost certainly have produced an incremental pack. Perhaps we need to make gc less magic.
Then, what really annoys me, is that when I finally did the push, it tried sending the single 27G pack file, when the remote already had 25G of the repository in several different packs(the site was an hg->git conversion). This part is just unacceptable.
Agreed. I've never run across that problem, so I can only assume it has something to do with many huge files being in the pack.
So, here are my questions/observations: 1: Handle the case of the ssh connection dying during git push(seems simple).
Not necessarily all that simple (we do not want to touch the ssh password if we can possibly avoid it, but the user shouldn't have to type it more than once), but certainly doable. Easier would probably be to recommend adding the proper SSH config variables, as has been stated elsewhere.
2: Is there an option to tell git to *not* be so thorough when trying to find similiar files. videos/doc/pdf/etc aren't always very deltafiable, so I'd be happy to just do full content compares.
See above. I *think* you can also do this with git-attributes, but I'm not sure. However, keeping the large media files in a sub-module would nicely solve that problem anyway, and is probably a good idea even with git-attributes support for pack delta- and compression settings.
3: delta packs seem to be poorly done. it seems that if one repo gets repacked completely, that the entire new pack gets sent, when the target has most of the objects already.
This is certainly not the case for most repositories. I believe there's something being triggered from repositories with many huge files though.
4: Are there any config options I can set to help in this? There are tons of options, and some documentation as to what each one does, but no recommended practices type doc, that describes what should be done for different kinds of workflows.
http://www.thousandparsec.net/~tim/media+git.pdf probably holds all the relevant information when it comes to storing large media files with git. I have not checked and have no inclination to do so.
ps: Thank you for your time. I hope that someone has answers for me.
Answers aplenty, I hope. I have neither time nor interest in developing this though, so the task of creating patches and/or documentation will have to fall to someone else.
pps: I'm not subscribed, please cc me. If I need to be subscribed, I'll do so, if told.
Subscribing won't be necessary. The custom on git@vger is to always Cc all who participate in the discussion, and only cull those who state they're no longer interested in the topic. -- Andreas Ericsson andreas.ericsson@xxxxxx OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html