Re: [PATCH v3] Prevent megablobs from gunking up git packs

"Dana How" <danahow@xxxxxxxxx> · Sat, 26 May 2007 16:48:18 -0700

On 5/26/07, Junio C Hamano <junkio@xxxxxxx> wrote:
Dana How <danahow@xxxxxxxxx> writes:
> diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
> @@ -371,8 +372,6 @@ static unsigned long write_object(struct sha1file *f,
>                               /* no if no delta */
>       int usable_delta =      !entry->delta ? 0 :
> -                             /* yes if unlimited packfile */
> -                             !pack_size_limit ? 1 :
>                               /* no if base written to previous pack */
>                               entry->delta->offset == (off_t)-1 ? 0 :
>                               /* otherwise double-check written to this
> @@ -408,7 +407,7 @@ static unsigned long write_object(struct sha1file *f,
>               buf = read_sha1_file(entry->sha1, &type, &size);
>               if (!buf)
>                       die("unable to read %s", sha1_to_hex(entry->sha1));
> -             if (size != entry->size)
> +             if (size != entry->size && type == obj_type)
>                       die("object %s size inconsistency (%lu vs %lu)",
>                           sha1_to_hex(entry->sha1), size, entry->size);

I do not quite get how these two hunks relate to the topic of
this patch.  Care to enlighten?

No problem.

When the code decides that a blob should not be written to the output file,
then I must make sure it is not used as a delta base.  A large blob
that triggered the size test and _was_ a delta base could be the result
of maxblobsize decreasing or being newly specified,
both without -f/--no-object-reuse,
and we need to tolerate the user forgetting the option.

To make sure that it is not so used,  I re-use the trick from maxpacksize
which ensures that a delta base is not in the previous split pack:
I set the offset field to -1.  Unfortunately,  I only checked for this magic
value when computing usable_delta if pack_size_limit was set.  It turns
out the test doesn't need to be conditional on pack_size_limit,  it works
for all cases;  so since I need to do the test when maxblobsize was specified
and maxpacksize wasn't, I deleted the pack_size_limit test.

Now for the second hunk.  The facts above mean we could have marked
this entry as a re-used delta, but we are unable to re-use the delta
because its delta base is not being written to this pack.  So we fall into
the !to_reuse case even though the size field in the object_entry is the
size of the delta,  not the object.  We can detect this by the type coming
from read_sha1_file being unequal to the type set from the pack (which is
one of OBJ_{REF,OFS}_DELTA).  So I disable the size matching
test in this case.

> @@ -564,6 +563,17 @@ static off_t write_one(struct sha1file *f,
> +     /* refuse to include as many megablobs as possible */
> +     if (max_blob_size && e->size >= max_blob_size) {
> +             struct stat st;
> +             /* skip if unpacked, remotely packed, or loose anywhere */
> +             if (!e->in_pack || !e->in_pack->pack_local || find_sha1_file(e->sha1, &st)) {
> +                     e->offset = (off_t)-1;  /* might drop reused delta base if mbs less */
> +                     written++;
> +                     return offset;
> +             }
> +     }
> +

I thought that you are simply ignoring the "naughty blobs"---why
should it be done this late in the call sequence?  I haven't
followed the existing code nor your patch closely, but I wonder
why the filtering is simply done inside (or by the caller of)
add_object_entry().  You would need to do sha1_object_info()
much earlier than the current code does, though.

Recently Nicolas Pitre improved the code as follows:
(1) tree-walking etc. which calls add_object_entry.
   We learn sha1, type, name(path), pack&offset, no_try_delta
   during this step.
(2) NEW: sort a table of pointers to these objects by pack_offset.
(3) Now call check_object on each object, but in the order
    determined in (2).  We learn each object's size during
    this step.  This requires us to inspect each object's header
    in the pack(s).

The result is that we smoothly scan through the pack(s),
instead of jumping all over the place.

If I move sha1_object_info earlier,  before (2),  then I undo
his optimization.  This fact ultimately justifies the first two
hunks that you commented on,  since it means we want
the objects to appear in the object list _before_ we can
decide not to write them,  and thus we need to handle
objects not written and all their consequences
(which didn't seem too strange to me,
since you already have preferred bases).

Thanks,
--
Dana L. How  danahow@xxxxxxxxx  +1 650 804 5991 cell
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html