Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size

Nicolas Pitre <nico@xxxxxxx> · Wed, 04 Apr 2007 23:17:45 -0400 (EDT)

On Wed, 4 Apr 2007, Dana How wrote:

> On 4/4/07, Nicolas Pitre <nico@xxxxxxx> wrote:
> > On Wed, 4 Apr 2007, Dana How wrote:
> > > The motivations are to better support portable media,
> > > older filesystems,  and larger repositories without
> > > awkward enormous packfiles.
> > 
> > I wouldn't qualify "enormous" pack files as "awkward".
> > 
> > It will always be more efficient to have only one pack to deal with
> > (when possible of course).
> Yes.  "(when possible of course)" refers to the remaining motivations
> I didn't explicitly mention: the 32b offset limit in .idx files,
> and keeping the mmap code working on a 32b system.
> I realize there are better solutions in the pipeline,
> but I'd like to address this now (for my own use) and hopefully
> also create something useful for 4GB-limited filesystems,
> USB sticks, etc.

I think this is a valid feature to have, no problem there.

> > > When --pack-limit[=N] is specified and --stdout is not,
> > > all bytes in the resulting packfile(s) appear at offsets
> > > less than N (which defaults to 1<<31).  The default
> > > guarantees mmap(2) on 32b systems never sees signed off_t's.
> > > The object stream may be broken into multiple packfiles
> > > as a result,  each properly and conventionally built.
> > >
> > 
> > This sounds fine.  *However* how do you ensure that the second pack (or
> > subsequent packs) is self contained with regards to delta base objects
> > when it is _not_ meant to be a thin pack?
> Good question.  Search for "int usable_delta" in the patch.
> With --pack-limit (offset_limit in C), you can use a delta if the base
> is in the same pack and already written out.  The first condition
> addresses your concern, and the second handles the case
> where the base object gets pushed to the next pack.
> These restrictions should be loosened for --thin-pack
> but I didn't do that yet.

OK.

> Also, --pack-limit turns on --no-reuse-delta.
> This is not necessary, but not doing it would have meant
> hacking up even more conditions which I didn't want to do
> on a newbie submission.

Thing is delta reusing is one of the most important feature for good 
performances so it has to work.

But let's take a moment to talk about your "newbie submission".  This 
patch is _way_ too large and covers too many things at once.  You'll 
have to split it in many smaller logical units pretty please.  For 
example, if you have to borrow code from fast-import then make a patch 
that perform that code movement _only_.  Then you can have another patch 
that move code around within builtin-pack-objects.c without changing any 
functionality just to prepare the way for the next patch which would 
concentrate on the new feature only.  Then make sure you have the 
addition of the new capability separate from the patch that let other 
part of GIT use it.  Etc.

That would be much easier for us to comment on smaller patches, and 
especially easier for you to rework one of the smaller patches than the 
big one if need be.

I looked at your patch and there are things I like and other things I 
don't like at all.  Because it is a single large patch I may only NAK 
the whole of it.

> > > When --stdout is also specified,  all objects in the
> > 
> > Please scrap that.  There is simply no point making --pack-limit and
> > --stdout work together.  If the amount of data to send over the GIT
> > protocol exceeds 4G (or whatever) it is the receiving end's business to
> > split it up _if_ it wants/has to.  The alternative is just too ugly.
> I have a similar but much weaker reaction, but Linus specifically asked for
> this combination to work.

Linus is a total imcompetent who knows nothing about programming or good 
taste.  So never ever listen to what he says.  He is wrong, always.

And in this case he's more wrong than usual.

;-)

> So I made it work as well as possible
> given no seeking.

The fact is that there is no point in imposing split packs on the 
receiving end if it can accept a single pack just fine.  Pack layout is 
and must remain a local policy, not something that the remote end felt 
was good for the peer.  This is true for the treshold value to decide 
whether fetched packs are kept as is or unpacked as loose objects.  This 
is the same issue for pack size limit.

And as you discovered yourself, it is quite messy to implement in 
pack-objects when the output is a stream because you don't know in 
advance how many objects will be sent so you have to invent special 
markers to indicate the end of pack.  This in turn doesn't let 
index-pack know how much to expect and provide some progress report at 
all.  The appropriate location for the splitting of packs in a fetch 
context is really within index-pack not in pack-objects.  And it is 
probably so much easier to do in index-pack anyway.

> > > When --blob-limit=N is specified,  blobs whose uncompressed
> > > size is greater than or equal to N are omitted from the pack(s).
> > > If --pack-limit is specified, --blob-limit is not, and
> > > --stdout is not,  then --blob-limit defaults to 1/4
> > > of the --pack-limit.
> > Is this really useful?
> > 
> > If you have a pack size limit and a blob cannot make it even in a 
> > pack of its own then you're screwed anyway.  It is much better to 
> > simply fail the operation than leaving some blobs behind.  IOW I 
> > don't see the usefulness of this feature.
> I agree if --stdout is specified.  This is why --pack-limit && --stdout
> DON'T turn on --blob-limit if not specified.

Let's forget about --stdout.  I hope I convinced you that it doesn't 
make sense.

> However, if I'm building packs inside a non-(web-)published
> repository, I find this useful. First of all, if there's some blob bigger
> than the --pack-limit I must drop it anyway -- it's not clear to me that
> the mmap window code works on 32b systems
> with >2GB-sized objects in packs.  An "all-or-nothing" limitation
> wouldn't be helpful to me.

But do you realize that if you drop even a single object you are 
screwed?  The fetching of a repo that is missing objects is a corrupt 
fetch.  You cannot just sent a bunch of objects and leave a couple 
behind in the hope that the peer will never need them.  For one the peer 
would never be able to perform a successful git-fsck.

If you care about your local usage only then this feature is bogus as 
well.  The blob _has_ to exist as a loose object before ever being 
packed.  If you have blobs larger than 2GB or 4GB and your filesystem is 
unable to cope with that then you're screwed already.  You won't be able 
to check them out, etc.

> But blobs even close to the packfile limit don't seem all that useful
> to pack either (this of course is a weaker argument).

In the extreme case where you have a blob that is near the pack size 
limit then you'll end up with one pack per blob. No problem there.  If 
you're lucky then you might have 10 big blobs which size is near the 
pack size limit, but because they delta well against each other you 
might be able to stuff them all in a single pack.

And if a blob is larger than the pack limit then either you should 
increase your pack size limit, or if the pack limit is already near the 
filesystem capacity for a single file then the blob cannot exist on that 
filesystem in the first place.

So you still have to convince me this is a useful feature.

> Packing plays two roles: archive storage (long life) and
> transmission (possibly short life).

Packing is also a _huge_ performance boost.  Don't underestimate that.

Nicolas
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html