Re: Performance issue: initial git clone causes massive repack

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Apr 04, 2009 at 03:07:43PM -0700, Robin H. Johnson wrote:

> During an initial clone, I see that git-upload-pack invokes
> pack-objects, despite the ENTIRE repository already being packed - no
> loose objects whatsoever. git-upload-pack then seems to buffer in
> memory.

We need to run pack-objects even if the repo is fully packed because we
don't know what's _in_ the existing pack (or packs). In particular we
want to:

  - combine multiple packs into a single pack; this is more efficient on
    the network, because you can find more deltas, and I believe is
    required because the protocol sends only a single pack.

  - cull any objects which are not actually part of the reachability
    chain from the refs we are sending

If no work needs to be done for either case, then pack-objects should
basically just figure that out and then send the existing pack (the
expensive bit is doing deltas, and we don't consider objects in the same
pack for deltas, as we know we have already considered that during the
last repack). It does mmap the whole pack, so you will see your virtual
memory jump, but nothing should require the whole pack being in memory
at once.

pack-objects streams the output to upload-pack, which should only ever
have an 8K buffer of it in memory at any given time.

At least that is how it is all supposed to work, according to my
understanding. So if you are seeing very high memory usage, I wonder if
there is a bug in pack-objects or upload-pack that can be fixed.

Maybe somebody more knowledgeable than me about packing can comment.

> During 'remote: Counting objects: 4886949, done.', git-upload-pack peaks at
> 2474216KB VSZ and 1143048KB RSS. 
> Shortly thereafter, we get 'remote: Compressing objects:   0%
> (1328/1994284)', git-pack-objects with ~2.8GB VSZ and ~1.8GB RSS. Here,
> the CPU burn also starts. On our test server machine (w/ git 1.6.0.6),
> it takes about 200 minutes walltime to finish the pack, IFF the OOM
> doesn't kick in.

Have you tried with a more recent git to see if it is any better? There
have been a number of changes since 1.6.0.6, although it looks like
mostly dealing with better recovery from corrupted packs.

> Given that the repo is entirely packed already, I see no point in doing
> this.
> 
> For the initial clone, can the git-upload-pack algorithm please send
> existing packs, and only generate a pack containing the non-packed
> items?

I believe that would require a change to the protocol to allow multiple
packs. However, it may be possible to munge the pack header in such a
way that you basically concatenate multiple packs. You would still want
to peek in the big pack to try deltas from the non-packed items, though.

I think all of this falls into the realm of the GSOC pack caching project.
There have been other discussions on the list, so you might want to look
through those for something useful.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux