Performance issue: initial git clone causes massive repack

"Robin H. Johnson" <robbat2@xxxxxxxxxx> · Sat, 4 Apr 2009 15:07:43 -0700

Hi,

This is a first in my series of mails over the next few days, on issues
that we've run into planning a potential migration for Gentoo's
repository into Git.

Our full repository conversion is large, even after tuning the
repacking, the packed repository is between 1.4 and 1.6GiB. As of Feburary
4th, 2009, it contained 4886949 objects. It is not suitable for
splitting into submodules either unfortunately - we have a lot of
directory moves that would cause submodule bloat.

During an initial clone, I see that git-upload-pack invokes
pack-objects, despite the ENTIRE repository already being packed - no
loose objects whatsoever. git-upload-pack then seems to buffer in
memory.

In a small repository, this wouldn't be a problem, as the entire
repository can fit in memory very easily. However, with our large
repository, git-upload-pack and git-pack-objects grows in memory to well
more than the size of the packed repository, and are usually killed by
the OOM.

During 'remote: Counting objects: 4886949, done.', git-upload-pack peaks at
2474216KB VSZ and 1143048KB RSS. 
Shortly thereafter, we get 'remote: Compressing objects:   0%
(1328/1994284)', git-pack-objects with ~2.8GB VSZ and ~1.8GB RSS. Here,
the CPU burn also starts. On our test server machine (w/ git 1.6.0.6),
it takes about 200 minutes walltime to finish the pack, IFF the OOM
doesn't kick in.

Given that the repo is entirely packed already, I see no point in doing
this.

For the initial clone, can the git-upload-pack algorithm please send
existing packs, and only generate a pack containing the non-packed
items?

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@xxxxxxxxxx
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85
Attachment:
pgpSifhB4ne8E.pgp

Description: PGP signature