Re: GIT as binary repository

Shawn Pearce <spearce@xxxxxxxxxxx> · Thu, 21 Oct 2010 10:38:50 -0700

On Thu, Oct 21, 2010 at 5:52 AM, Wilson, Kevin Lee (OpenView Engineer)
<kevin.l.wilson@xxxxxx> wrote:
> We are investigating the use of GIT as a binary repository solution. Our larger files are near 800MB and the total checked out repo size is about 3 GB the repo size in SVN is more like 20-30GB, if we could prune the history prior to MR, we could get these sizes down considerably. This binary repo is really for our super project build.  From what I have read and learned, this is not a good fit for the GIT tool. Have there been performance improvements lately? Some of the posts I have read have been quite old?

Not really.

Teams who need to store content like this are taking two approaches:

Bite the bullet and use 64 bit systems with a lot of physical memory.
Git allocates/memmaps two memory blocks equal in size to the file you
are trying to work with.  If you have an 800MB file, you need ~1.6G of
physical memory just for the Git executable to touch that file.  For
most modern desktops and server systems, this is pretty easy to deal
with, 4G or 8G of physical memory in a developer workstation is pretty
inexpensive.  If the files aren't delta compressible, you can speed up
delta compressing operations that occur during `git gc` by adding a
gitattribute with the "-delta" flag for the relevant path files to
your .git/info/attributes file.  Unfortunately this may mean that your
Git repository is large (>20G?), and each developer needs to make a
full copy of it when they start to work on the project.  That is a lot
of data to move around, or to store locally.  But again when you look
at the cost of disk on a developer workstation, this may not be an
issue if your team can adopt a workflow where they don't clone the
repository often.  (E.g. the Android repository is about 7G,
developers clone it once and then don't need to again... so it is
doable.)

The other option is to use a different repository for the binary
files.  Some teams are using a REST enabled HTTP server like Amazon S3
(though you probably want something inside your corporate firewall) to
store the large binary files.  Instead of putting the files into Git
the put a small shell script and a pointer to the file into Git.  The
shell script downloads the large binary file when executed, and the
build process (or the developer "start-up" instructions) execute the
script to get the latest versions bootstrapped on the local
workstation.

> I also have some questions, about how the workflow would be for getting all of the changes merged from several different teams into the one repository would operate. Do we setup a shared system for engineers to perform the merges onto? Our teams are geographically disbursed.

Yes, this is the common approach.  Actually what I have started to see
with Android is, each distributed office has a shared repository that
the engineers in that office interact with on a daily basis.  And
someone in each office synchronizes that repository with a single
central repository that exists somewhere else.  Because of the nature
of Git, the central repository can be continuously pulled into the
distributed office through a cron script.  Engineers in the office can
therefore always have a "fairly latest" version available, but can
also fork off onto a side-branch and defer merging with the other
offices for a day or two.

Android teams are successfully using this approach by running Gerrit
Code Review[1] as their central server, and using Gerrit's built-in
replication feature to push updates to the distributed office servers.
 In effect there is one central server for writes, but a lot of read
operations are offloaded into the distributed offices local copies.

[1] http://code.google.com/p/gerrit/

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html