Johannes Schindelin venit, vidit, dixit 06.04.2011 11:25: > Hi, > > On Wed, 6 Apr 2011, Pau Garcia i Quiles wrote: > >> Binary large objects. I know it has been discussed once and again but >> I'd like to know if there is something new. >> >> Some corporation hired the company I work for one year ago to develop a >> large application. They imposed ClearCase as the VCS. I don't know if >> you have used it but it is a pain in the ass. We have lost weeks of >> development to site-replication problems, funny merges, etc. We are >> trying to migrate our project to git, which we have experience with. >> >> One very important point in this project (which is Windows only) is >> putting binaries in the repository. So far, we have suceeded in not >> doing that in other projects but we will need to do that in this >> project. >> >> In the Windows world, it is not unusual to use third-party libraries >> which are only available in binary form. Getting them as source is not >> an option because the companies developing them are not selling the >> source. Moving from those binary-only dependencies to something else is >> not an option either because what we are using has some unique features, >> be it technical features or support features. In our project, we have >> about a dozen such binaries, ranging from a few hundred kilobytes, to a >> couple hundred megabytes (proprietary database and virtualization >> engine). >> >> The usual answer to the "I need to put binaries in the repository" >> question has been "no, you do not". Well, we do. We are in heavy >> development now, therefore today's version may depend on a certain >> version of a third-party shared library (DLL) which we only can get in >> binary form, and tomorrow's version may depend on the next version of >> that library, and you cannot mix today's source with yesterday's >> third-party DLL. I. e. to be able to use the code from 7 days ago at >> 11.07 AM you need "git checkout" to "return" our source AND the binaries >> we were using back then. This is something ClearCase manages >> satisfactorily. > > I understand. The problem in your case might not be too bad, after all. > The problem only arises when you have big files that are compressed. If > you check in multiple versions of an uncompressed .dll file, Git will > usually do a very good job at compressing them. > > If they are compressed, what you probably need is something like a sparse > clone, which is sort of available in the form of shallow clones, but it is > too limited still. > > Having said that, in another company I work for, they hav 20G repositories > and they will grow larger. That is something they incurred due to > historical reasons, and they are willing to pay the price in terms of disk > space. Due to Git's distributed nature, they had no problems with cloning; > they just use a local reference upon initial clone. > >> I have read about: >> - submodules + using different repositories once one "blob repository" >> grows too much. This will be probably rejected because it is quite >> contrived. > > I would also recommend against this, because submodules are a very weak > part of Git. > >> - git-annex (does not get the files in when cloning, pulling, checking >> out; you need to do it manually) >> - git-media (same as git-annex) > > Yes, this is an option, but a bit klunky. > >> - boar (no, we do not want to use a VCS for binaries in addition to git) > > I did not know about that. > >> - and a few more >> >> So far the only good solution seems to be git-bigfiles but it's still >> in development. > > It has stalled, apparently, but I wanted to have a look at it anyway. Will > let you know of my findings! I think in many applications the "download-on-demand" approach which git-annex takes is very important. (I don't know how far our sparse/shallow supports this.) Also, their remote backends look interesting. And no, I don't want Haskell as yet another language for our code base. Fedora handles big files (compressed tar balls) in git with a file store, scripting (fedpkg) and tracking only a text file with hash values ("sources") in git; somehow a baby version of git-annex. The symlink based approach of annex (big file is a symlink to the "object store" which is indexed by blob content sha1) reminds me very much of our notes trees and the way textconv-cache uses it. It feels as if we already have all the pieces in place. (I don't think we need to track big files' contents, only their hashes; this is fast for read-only media, see annex' worm-backend.) Another crazy idea would be to "git replace" big files by place-holders (blob with the big file's sha1 as content) or rather the other way round, but I haven't thought this through. Michael -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html