Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/4/07, Nicolas Pitre <nico@xxxxxxx> wrote:
On Wed, 4 Apr 2007, Dana How wrote:
> The motivations are to better support portable media,
> older filesystems,  and larger repositories without
> awkward enormous packfiles.

I wouldn't qualify "enormous" pack files as "awkward".

It will always be more efficient to have only one pack to deal with
(when possible of course).
Yes.  "(when possible of course)" refers to the remaining motivations
I didn't explicitly mention: the 32b offset limit in .idx files,
and keeping the mmap code working on a 32b system.
I realize there are better solutions in the pipeline,
but I'd like to address this now (for my own use) and hopefully
also create something useful for 4GB-limited filesystems,
USB sticks, etc.

> When --pack-limit[=N] is specified and --stdout is not,
> all bytes in the resulting packfile(s) appear at offsets
> less than N (which defaults to 1<<31).  The default
> guarantees mmap(2) on 32b systems never sees signed off_t's.
> The object stream may be broken into multiple packfiles
> as a result,  each properly and conventionally built.
>

This sounds fine.  *However* how do you ensure that the second pack (or
subsequent packs) is self contained with regards to delta base objects
when it is _not_ meant to be a thin pack?
Good question.  Search for "int usable_delta" in the patch.
With --pack-limit (offset_limit in C), you can use a delta if the base
is in the same pack and already written out.  The first condition
addresses your concern, and the second handles the case
where the base object gets pushed to the next pack.
These restrictions should be loosened for --thin-pack
but I didn't do that yet.
Also, --pack-limit turns on --no-reuse-delta.
This is not necessary, but not doing it would have meant
hacking up even more conditions which I didn't want to do
on a newbie submission.

> When --stdout is also specified,  all objects in the
> resulting packfile(s) _start_ at offsets less than N.
> All the packfiles appear concatenated on stdout,
> and each has its object count set to 0.  The behavior
> without --stdout cannot be duplicated here since
> lseek(2) is not generally possible on stdout.

Please scrap that.  There is simply no point making --pack-limit and
--stdout work together.  If the amount of data to send over the GIT
protocol exceeds 4G (or whatever) it is the receiving end's business to
split it up _if_ it wants/has to.  The alternative is just too ugly.
I have a similar but much weaker reaction, but Linus specifically asked for
this combination to work.  So I made it work as well as possible
given no seeking.

> When --blob-limit=N is specified,  blobs whose uncompressed
> size is greater than or equal to N are omitted from the pack(s).
> If --pack-limit is specified, --blob-limit is not, and
> --stdout is not,  then --blob-limit defaults to 1/4
> of the --pack-limit.
Is this really useful?

If you have a pack size limit and a blob cannot make it even in a pack
of its
own then you're screwed anyway.  It is much better to simply fail the
operation than leaving some blobs behind.  IOW I don't see the
usefulness of this feature.
I agree if --stdout is specified.  This is why --pack-limit && --stdout
DON'T turn on --blob-limit if not specified.

However, if I'm building packs inside a non-(web-)published
repository, I find this useful. First of all, if there's some blob bigger
than the --pack-limit I must drop it anyway -- it's not clear to me that
the mmap window code works on 32b systems
with >2GB-sized objects in packs.  An "all-or-nothing" limitation
wouldn't be helpful to me.
But blobs even close to the packfile limit don't seem all that useful
to pack either (this of course is a weaker argument).
In the sample (p4) checkout I'm testing on [i.e. no history],
I have 56K+ objects consuming ~55GB uncompressed;
there are 9 blobs over 500MB each uncompressed.
I'm guessing packing them is not a performance advantage,
and I certainly wouldn't want frequently-used objects to be
stuck between them.  [ I guess my repo stats are going to
be a bit strange ;-) ]

Packing plays two roles: archive storage (long life) and
transmission (possibly short life).
These seem to pull the packing code in different directions.

Thanks,
--
Dana L. How  danahow@xxxxxxxxx  +1 650 804 5991 cell
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]