Re: [PATCH v3] Prevent megablobs from gunking up git packs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/26/07, Nicolas Pitre <nico@xxxxxxx> wrote:
On Sat, 26 May 2007, Dana How wrote:
> Extremely large blobs distort general-purpose git packfiles.
> These megablobs can be either stored in separate "kept" packfiles,
> or left as loose objects.  Here we add some features to help
> either approach.
>
> This patch implements the following:
> 1. git pack-objects accepts --max-blob-size=N,  with the effect that
>    only loose blobs less than N KB are written to the packfiles(s).
>    If an already packed blob violates this limit (perhaps these are
>    fast-import packs or max-blob-size was reduced),  it _is_ passed
>    through if from a local pack and no loose copy exists.

I'm still not convainced by this feature.  Is it really necessary?

Wouldn't it be better if the --max-blob-size=N was instead a
--trailing-blob-size=N to specify which blobs are considered "naughty"
per our previous discussion? This way there is no incoherency with
already packed blobs larger than the treshold that you have to pass
through.

This, combined with the option to disable deltification of large blobs
(both options can be specified with the same size), and possibly the
pack size limit, would solve your large blob issue, shouldn't it?

Unfortunately, it doesn't.

There are at least three reasonable ways to handle large blobs:
(1) git-repack -a repacks everything.  Naughty blobs get pushed to
    the end as discussed (possibly dominating later split packs).
(2) Naughty blobs accumulate in separate "kept" packs.
    git-repack -a only repacks nice blobs.  Separate scripts,
    or new options to git-repack,  are needed to repack the "kept" packs.
    A number of people have discussed ideas like this.
(3) Naughty blobs are kept loose.

We have 255GB compressed in our Perforce repository and
it grows by 2GB+ per week.  Although I'm only considering bringing ~10%
of this into git,  it would be good for me to be able to argue that
I could bring more.  Every day the equivalent of ~1K+ blobs are committed.
How often should I repack the shared repository [that replaces Perforce]?
With this level of traffic I believe I should do it every night.

I've been discussing these plans with IT here since they maintain
everything else.
They would like any part of the database that is going to be reorganized
and replaced to be backed up first.  If only (1) is available,  and I
repack every
night,  then I need to back up the entire repository every night as well.
If I use (2) or (3),  then I back up just the repacked portion each night,
back up the kept packs only when they are repacked (on a slower schedule),
and/or back up the loose blobs on a similar schedule.

Besides this back up issue,  I simply don't want to have to repack _all_
of such a large repository each night.  With (1), nightly repacks get longer
and longer, and harder to schedule.

I think the minimum features needed to support (2) and (3) are the same:
(a) An easy way to prevent loose blobs exceeding some size limit
    from migrating into "nice" packs;
(b) A way to prevent packed objects from being copied when
    (i) they no longer meet the (new or reduced) size limit AND
    (ii) they exist in some other safe form in the repository.
The behavior of --max-blob-size=N in this patch provides both of these
while deleting other behavior people didn't like.

You mentioned "incoherency" above;
I'm not too sure how to proceed on that.
If you have a more coherent way to provide (a) and (b) above,
please let me know.

Thanks,
--
Dana L. How  danahow@xxxxxxxxx  +1 650 804 5991 cell
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux