Hi Git people, I'm an applicant to the GSoC within git. I would like to help building a better big file support mechanism. I have read the latest threads on this topic: http://thread.gmane.org/gmane.comp.version-control.git/165389/focus=165389 http://thread.gmane.org/gmane.comp.version-control.git/168403/focus=168852 Here's a compilation of what I read and what I think. What come the most are OOM issues. But I think that the problem is, git tries to work exactly the same on binaries and text. If we managed one way or another to skip tasks (what "intelligent" operations are possible on binaries ? Almost none...) we should be able to avoid them, like most of the time. This means that a first step will be to introduce an autodetection mechanism. Jeff King argues that, on binaries, we got uninteresting diffs, and compression is often useless. I agree. We would better not compress any of them (okay, tons of zeros would compress well but who's going to track zeroes?). Eric Montellese says: "Don't track binaries in git. Track their hashes." I agree here too. We should not treat computer data like source code (or whatever text). He claims that he needs to handle repos containing source code + zipped tarballs + large and/or many binaries. Users seem to really need binary tracking and therefore git should do it. I personally needed to a couple of times. He also says that we could want to do download-as-needed and remove-unnecessary operations, and I think that it may be clean enough to add a git command like 'git blob' to handle special operations for binaries. Perhaps in a second step. Another idea was to create "sparse" repos, considered leafs as they may not be cloned from because they lack full data. But it may or may not be in the spirit of Git... What I personally would like as a feature is the ability to store the main repo with sources etc. into a conventional repo but put the data elsewhere on a storage location. This would allow to develop programs which need data to run (like textures in games etc.) without making the repo slow, big or just messy. I faced the situation on TuxFamily where the website, Git/SVN etc. are on one quick server and the download area on another one. The first was limited to something like 100MB and the second to 1GB, extensible to more if needed. On the same idea, on my home server with multiple OpenVZ containers I host repos for my projects on one free-to-access container which may be attacked, or even compromised which has a small disk partition. On the other side my data is on a ssh-only, secured, firewalled big partition. It may be useful to have code on the first but ssh'd data on the other. I suspect many other situations where a separation between code and data may help administrators and/or users. To handle this I thought of a mechanism allowing a sort of branch (to make use of multiple 'remote') to be checked out at the same time as the code... In addition we should use an extensible protocol layer to manage data. git-annex or git-media which already address some of the problems here are using various things like ssh, http, s3. And I just saw that Debian's git package already recommend rsync. What do you think about that whole? Would it fit on a GSoC background? Great interesting task indeed. May sound too long. But of course if the summer went too short I would not drop the project on the floor as soon as the time limit will be reached. Best regards, -- Jonathan Michalon -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html