Re: Performance issue: initial git clone causes massive repack

"Robin H. Johnson" <robbat2@xxxxxxxxxx> · Sat, 4 Apr 2009 17:37:53 -0700

On Sun, Apr 05, 2009 at 02:05:36AM +0200, Nicolas Sebrecht wrote:
> > Our full repository conversion is large, even after tuning the
> > repacking, the packed repository is between 1.4 and 1.6GiB. As of Feburary
> > 4th, 2009, it contained 4886949 objects. It is not suitable for
> > splitting into submodules either unfortunately - we have a lot of
> > directory moves that would cause submodule bloat.
> Actually, I'm not sure that a full portage tree repository would be the
> best thing to do. It would not be suitable in the long term and working
> on the repository/history would be a big mess. Why provide a such repo ?
> Or at least, why provide a such readable repo ?
> 
> IMHO, you should provide a repository per upstream package on the main
> server.
That causes incredibly bloat unfortunately.

I'll summarize why here for the git mailing list. Most our developers
have the entire tree checked out, and in informal surveys, would like to
continue to do so. There are ~13500 packages right now (I'm excluding
eclasses/, profiles/, scripts/), and growing by 15-25 new packages/week.
(~45% of packages also have a files/ directory).

For each package, the .git directory, assuming in a single pack,
consumes at least 36 inodes.  Tail-packing is limited to Reiserfs3 and
JFS, and isn't widely used other than that, so assuming 4KiB inodes,
that's an overhead of at least 144KiB per package. Multiple by the
number of packages, and we get an overhead of 2GiB, before we've added
ANY content.

Without tail packing, the Gentoo tree is presently around 520MiB (you
can fit it into ~190MiB with tail packing). This means that
repo-per-package would have an overhead in the range of 400%.

Additionally, there's a lot of commonality between ebuilds and packages,
and having repo-per-package means that the compression algorithms can't
make use of it - dictionary algorithms are effective at compression for
a reason.

Overhead is the reason that we refused to migrate to SVN as well.
- CVS, per each directory of data, has a constant overhead of 4 inodes
  (CVS/ CVS/Root CVS/Repository CVS/Entries)
- SVN, for each data directory, has another complete copy of the data,
  plus a minimum of 10 other inodes.
- Git costs a minimum 36 inodes per repository. In a fully packed repo,
  the number of inodes tends to stay below 50 in all cases.

> PS: what about cc'ing gentoo-scm list ?
It's not an open-posting list, so anybody here on the git list simply
replying would not get their post on there. The issue has been raised
there, and this mainly meant to find a resolution to that problem.

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@xxxxxxxxxx
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85
Attachment:
pgpdbOSeiPYTx.pgp

Description: PGP signature