Re: [PATCH v2] Custom compression levels for objects and packs

"Dana How" <danahow@xxxxxxxxx> · Wed, 9 May 2007 02:21:45 -0700

On 5/8/07, Nicolas Pitre <nico@xxxxxxx> wrote:
On Tue, 8 May 2007, Dana How wrote:
> I think it should be straightforward for me to re-submit this
> based on current master.
Since this patch is simpler it could be merged much faster, before the
pack limit series.
Yes,  I will do "custom compression" patch first.

> > > +     /* differing core & pack compression when loose object -> must
> > recompress */
> > I am not sure if that is worth it, as you do not know if the
> > loose object you are looking at were compressed with the current
> > settings.
> You do not know for certain, that is correct.  However, config
> settings setting unequal compression levels signal that you
> care differently about the two cases. (For me,  I want the
> compression investment to correspond to the expected lifetime of the file.)
> Also,  *if* we have the knobs we want in the config file,
> I don't think we're going to be changing these settings all that often.
>
> If I didn't have this check forcing recompression in the pack,
> then in the absence of deltification each object would enter the pack
> by being copied (in the preceding code block) and pack.compression
> would have little effect.  I actually experienced this the very first
> time I imported a large dataset into git (I was trying to achieve the
> effect of this patch by changing core.compression dynamically,  and
> was a bit mystified for a while by the result).
>
> Thus,  if core.loosecompression is set to speed up git-add,  I should
> take the time to recompress the object when packing if pack.compression
> is different (of course the hit of not doing so will be lessened by
> deltification
> which forces a new compression).

Right.  And this also depends whether or not you have core.legacyheaders
set to false or not.

And the whole purpose for setting core.legacyheaders is exactly to allow
for loose objects to be copied straight into the pack.  This should have
priority over mismatched compression levels IMHO.
OK,  I got really confused here, so I looked over the code,
and figured out 2 causes for my confusion.
(1) core.legacyheaders controls use_legacy_headers, which defaults to 1.
So currently all loose objects are in legacy format and the code block
I spoke of doesn't trigger [without a config setting].  I didn't realize
legacy headers were still being produced (mislead by the name!).
(2) I read your "setting core.legacyheaders" as followed by TRUE,
but you meant FALSE, which is not the default.

I also read that 1 year after 1.4.2, the default for core.legacyheaders is going
to change to FALSE.  I think our discussion should assume this has
happened.  So let's assume FALSE in the following.  The point of that
is that such a FALSE setting can't be assumed to have any special intent;
it will be the default.

[Everything I write here boils down to only one question,
which I repeat at the end.]

Data gets into a pack in these ways:
1. Loose object copied in;
2. Loose object newly deltified;
3. Packed object to be copied;
4. Packed object to be newly deltified;
5. Packed deltified object we can't re-use;
6. Packed deltified object we can re-use.
["copied" includes recompressed.]

In (2), (4), and (5), pack.compression will always be newly used.
If pack.compression doesn't change,  this means (6)
will be using pack.compression since it comes from (2) or (4).
So if I "guarantee" that (1) uses pack.compression,
(3) will as well, meaning everything in the pack will be
at pack.compression.

Thus if pack.compression != core.loosecompression takes precedence
over core.legacyheaders = false,  then for pack.compression constant
we get all 6 cases at level pack.compression.  If core.legacyheaders =
false takes precedence as you suggest,  then all undeltified objects
(20%?) will be stuck at core.loosecompression [since I see no way
to sensibly re-apply compression to something copied pack-to-pack].
I think this is inconsistent with what a pack.compression !=
core.loosecompression setting is telling us.

My arguments have 2 biases.
(a) I assume pack.compression stays constant, and if it changes,
I see little value in worrying about "forcing" all objects in a new pack,
some of which might be copied from an old pack, to be at the same
compression level.
(b) I focused first on packing to disk,  where a constant pack.compression
makes sense. For packing to stdout (pack transfer),  especially on
a popular, hence loaded, repository server, we may want to use
--compress=N to crank down on effort spent compressing new deltas [2+4]
(or loose objects [1]) (or unusable deltas [5]).  This still lets
better compressed
objects flow through in the other cases [3+6].  So for one-use packs,
a mix of compression levels could be desirable and
is achievable with --compression=N if/when so.

Also, when repacking, delta reuse does not recompress objects for the
same reason, regardless of the compression level used when they were
compressed initially.  Same argument goes for delta depth.
Delta reuse doesn't need recompression.  For pack.compression
constant,  they will already be at the correct level.  I think we agree
on behavior here for different reasons.

So if you really want to ensure a compression level on the whole pack,
you'll have to use -f with git-repack. Or leave core.legacyheaders
unset.
-f would be needed if you were in the practice of changing
pack.compression, yes.

So, after covering all these cases, have I convinced you that
recompression takes precedence over legacyheaders = FALSE?

Thanks,
--
Dana L. How  danahow@xxxxxxxxx  +1 650 804 5991 cell
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html