Better big file support & GSoC

Jonathan Michalon <johndescs@xxxxxxxxx> · Sat, 2 Apr 2011 16:40:51 +0200

Hi Git people,

I'm an applicant to the GSoC within git.
I would like to help building a better big file support mechanism.

I have read the latest threads on this topic:
http://thread.gmane.org/gmane.comp.version-control.git/165389/focus=165389
http://thread.gmane.org/gmane.comp.version-control.git/168403/focus=168852

Here's a compilation of what I read and what I think.

What come the most are OOM issues. But I think that the problem is, git tries
to work exactly the same on binaries and text. If we managed one way or another
to skip tasks (what "intelligent" operations are possible on binaries ?
Almost none...) we should be able to avoid them, like most of the time.
This means that a first step will be to introduce an autodetection mechanism.

Jeff King argues that, on binaries, we got uninteresting diffs, and compression
is often useless. I agree. We would better not compress any of them (okay, tons
of zeros would compress well but who's going to track zeroes?).

Eric Montellese says: "Don't track binaries in git. Track their hashes." I agree
here too. We should not treat computer data like source code (or whatever text).
He claims that he needs to handle repos containing source code + zipped tarballs
+ large and/or many binaries. Users seem to really need binary tracking and
therefore git should do it. I personally needed to a couple of times.

He also says that we could want to do download-as-needed and remove-unnecessary
operations, and I think that it may be clean enough to add a git command like
'git blob' to handle special operations for binaries. Perhaps in a second step.

Another idea was to create "sparse" repos, considered leafs as they may not be
cloned from because they lack full data. But it may or may not be in the
spirit of Git...

What I personally would like as a feature is the ability to store the main
repo with sources etc. into a conventional repo but put the data elsewhere
on a storage location. This would allow to develop programs which need data
to run (like textures in games etc.) without making the repo slow, big or
just messy.
I faced the situation on TuxFamily where the website, Git/SVN etc. are on one
quick server and the download area on another one. The first was limited to
something like 100MB and the second to 1GB, extensible to more if needed.
On the same idea, on my home server with multiple OpenVZ containers I host repos
for my projects on one free-to-access container which may be attacked, or even
compromised which has a small disk partition. On the other side my data is on a
ssh-only, secured, firewalled big partition. It may be useful to have code on
the first but ssh'd data on the other.
I suspect many other situations where a separation between code and data may
help administrators and/or users.
To handle this I thought of a mechanism allowing a sort of branch (to make use
of multiple 'remote') to be checked out at the same time as the code...
In addition we should use an extensible protocol layer to manage data.
git-annex or git-media which already address some of the problems here
are using various things like ssh, http, s3. And I just saw that Debian's git
package already recommend rsync.

What do you think about that whole? Would it fit on a GSoC background? Great
interesting task indeed. May sound too long. But of course if the summer went
too short I would not drop the project on the floor as soon as the time limit
will be reached.

Best regards,

--
Jonathan Michalon
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html