Re: large(25G) repository in git

Adam Heath <doogie@xxxxxxxxxxxxx> · Thu, 26 Mar 2009 11:35:17 -0500

Marcel M. Cary wrote:
> My company manages code in a similar way, except we avoid this kind of
> issue (with 100 gigabytes of user-uploaded images and other data) by not
> checking in the data.  We even went so far is as to halve the size of
> our repository by removing 2GB of non-user-supplied images -- rounded
> corners, background gradients, logos, etc, etc.  This made Git
> noticeably faster.

Disk space is cheap.

> While I'd love to be able to handle your kind of use case and data size
> with Git in that way, it's a little beyond the intended usage to handle
> hundreds of gigabytes of binary data, I think.
> 
> I imagine as your web site grows, which I'm assuming is your goal, your
> problems with scaling Git will continue to be a challenge.
> 
> Maybe you can find a way to:
> 
> * Get along with less data in your non-production environments; we're
> hoping to be able to do this eventually

We do that by only cloning/checking out certain modules.

However, as is always the case, sometimes a bug occurs with production
data, and you need to use the real data to track it down.

> * Find other ways to copy it; we use rsync even though it does take
> forever to crawl over the file system
> 
> * Put your data files in a separate Git repository, at least, assuming
> your checkin, update, and release code more often than your video files.
>  That way you'll experience pain less often, and maybe even be able to
> tune your repository differently.

As already mentioned, our sub-sites *are* in separate repos.  There's
a base repository, that has just the event/backend code.  Then 32
*other* repositories, where the actual websites are.

We want to use *some* kind of versioning system.  Being able to have
history of *all* changes is extremely useful.  Not to mention being
able to track what each separate user does as they modify their files
thru their browser.

subversion is just right out.  It's centralized.  It leaves poop all
over the place.

mercurial is just right out.  If you do several *separate* commits of
*separate* files, but don't push for some time period, then eventually
do a push/pull, where the sum total of the changes is larger than some
value, mercurial will fail when it tries to then update the local
directory.  This limit is based on 2G, a hard-coded python limit(even
on a 64-bit host), because mercurial reads the entire set of changes
into a python string.

git mmaps files, does window scanning of the pack files.  It *might*
read a single file all into memory, for compression purposes; I'm not
certain on this.  We certainly haven't hit any limits that cause it to
fail outright.

I haven't tried any others.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html