Re: blobs (once more)

Michael J Gruber <git@xxxxxxxxxxxxxxxxxxxx> · Wed, 06 Apr 2011 14:20:09 +0200

Johannes Schindelin venit, vidit, dixit 06.04.2011 11:25:
> Hi,
> 
> On Wed, 6 Apr 2011, Pau Garcia i Quiles wrote:
> 
>> Binary large objects. I know it has been discussed once and again but 
>> I'd like to know if there is something new.
>>
>> Some corporation hired the company I work for one year ago to develop a 
>> large application. They imposed ClearCase as the VCS. I don't know if 
>> you have used it but it is a pain in the ass. We have lost weeks of 
>> development to site-replication problems, funny merges, etc. We are 
>> trying to migrate our project to git, which we have experience with.
>>
>> One very important point in this project (which is Windows only) is 
>> putting binaries in the repository. So far, we have suceeded in not 
>> doing that in other projects but we will need to do that in this 
>> project.
>>
>> In the Windows world, it is not unusual to use third-party libraries 
>> which are only available in binary form. Getting them as source is not 
>> an option because the companies developing them are not selling the 
>> source. Moving from those binary-only dependencies to something else is 
>> not an option either because what we are using has some unique features, 
>> be it technical features or support features. In our project, we have 
>> about a dozen such binaries, ranging from a few hundred kilobytes, to a 
>> couple hundred megabytes (proprietary database and virtualization 
>> engine).
>>
>> The usual answer to the "I need to put binaries in the repository" 
>> question has been "no, you do not". Well, we do. We are in heavy 
>> development now, therefore today's version may depend on a certain 
>> version of a third-party shared library (DLL) which we only can get in 
>> binary form, and tomorrow's version may depend on the next version of 
>> that library, and you cannot mix today's source with yesterday's 
>> third-party DLL. I. e. to be able to use the code from 7 days ago at 
>> 11.07 AM you need "git checkout" to "return" our source AND the binaries 
>> we were using back then. This is something ClearCase manages 
>> satisfactorily.
> 
> I understand. The problem in your case might not be too bad, after all. 
> The problem only arises when you have big files that are compressed. If 
> you check in multiple versions of an uncompressed .dll file, Git will 
> usually do a very good job at compressing them.
> 
> If they are compressed, what you probably need is something like a sparse 
> clone, which is sort of available in the form of shallow clones, but it is 
> too limited still.
> 
> Having said that, in another company I work for, they hav 20G repositories 
> and they will grow larger. That is something they incurred due to 
> historical reasons, and they are willing to pay the price in terms of disk 
> space. Due to Git's distributed nature, they had no problems with cloning; 
> they just use a local reference upon initial clone.
> 
>> I have read about:
>> - submodules + using different repositories once one "blob repository"  
>>   grows too much. This will be probably rejected because it is quite 
>>   contrived.
> 
> I would also recommend against this, because submodules are a very weak 
> part of Git.
> 
>> - git-annex (does not get the files in when cloning, pulling, checking 
>>   out; you need to do it manually)
>> - git-media (same as git-annex)
> 
> Yes, this is an option, but a bit klunky.
> 
>> - boar (no, we do not want to use a VCS for binaries in addition to git)
> 
> I did not know about that.
> 
>> - and a few more
>>
>> So far the only good solution seems to be git-bigfiles but it's still
>> in development.
> 
> It has stalled, apparently, but I wanted to have a look at it anyway. Will 
> let you know of my findings!

I think in many applications the "download-on-demand" approach which
git-annex takes is very important. (I don't know how far our
sparse/shallow supports this.) Also, their remote backends look
interesting. And no, I don't want Haskell as yet another language for
our code base.

Fedora handles big files (compressed tar balls) in git with a file
store, scripting (fedpkg) and tracking only a text file with hash values
("sources") in git; somehow a baby version of git-annex.

The symlink based approach of annex (big file is a symlink to the
"object store" which is indexed by blob content sha1) reminds me very
much of our notes trees and the way textconv-cache uses it. It feels as
if we already have all the pieces in place. (I don't think we need to
track big files' contents, only their hashes; this is fast for read-only
media, see annex' worm-backend.)

Another crazy idea would be to "git replace" big files by place-holders
(blob with the big file's sha1 as content) or rather the other way
round, but I haven't thought this through.

Michael
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html