Re: [RFC] Packing large repositories

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 28 Mar 2007 09:53:07 -0700 (PDT)

On Wed, 28 Mar 2007, Dana How wrote:
> 
> I just started experimenting with using git on
> a large engineering project which has used p4 so far.
> Part of a checkout is about 55GB;
> after an initial commit and packing I have a 20GB+ packfile.

Oh wow. You don't do half measures, do you ;)

> Of course this is unusable, since object_entry's in an .idx
> file have only 32 bits in their offset fields.  I conclude that
> for such large projects,  git-repack/git-pack-objects would need
> new options to control maximum packfile size.

Either that, or update the index file format. I think that your approach 
of having a size limiter is actually the *better* one, though. 

> [ I don't think this affects git-{fetch,receive,send}-pack
> since apparently only the pack is transferred and it only uses
> the variable-length size and delta base offset encodings
> (of course the accumulation of the 7 bit chunks in a 32b
> variable would need to be corrected, but at least the data
> format doesn't change).]

Well, it does affect fetching, in that "git index-pack" obviously would 
also need to be taught how to split the resulting indexed packs up into 
multiple smaller ones from one large incoming one. But that shouldn't be 
fundamentally hard either, apart from the inconvenience of having to 
rewrite the object count in the pack headers..

To avoid that issue, it may be that it's actually better to split things 
up at pack-generation time *even* for the case of --stdout, exactly so 
that "git index-pack" wouldn't have to split things up (we potentially 
know a lot more about object sizes up-front at pack-generation time than 
we know at re-indexing).

> So I am toying with adding a --limit <size> flag to
> git-repack/git-pack-objects.

Sounds very sane.

> This cannot be used with --stdout.  If specified, e.g.
>  git-repack --limit 2g
> then each packfile created could be at most 2^31-1 bytes in size.

Sounds good, apart from the caveat above about "--stdout" that needs some 
thinking about.

> It's possible that multiple packfiles would be created in one shot.
> Thus git-pack-objects could write multiple names to stdout
> and git-repack would need to be updated accordingly.

Yes. That seems to be the least of all problems.

> Finally, I wonder if having tree/commit/tag objects mixed into
> such large packfiles would be a performance hit.

My initial reaction is that it's best to start off without worrying about 
that, and just do everything in the order that we do now (ie "sort by 
type first, recency second") and just split when we hit the size limit.

But if you actually want to experiment with different organizations, I 
don't think that's wrong. I would just personally start without it.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html