Re: [PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 10 Aug 2016 13:17:22 -0700

Jeff King <peff@xxxxxxxx> writes:

> ...
> We could do analysis on any cycles that we find to
> distinguish the two cases (i.e., it is a bogus pack if and
> only if every delta in the cycle is in the same pack), but
> we don't need to. If there is a cycle inside a pack, we'll
> run into problems not only reusing the delta, but accessing
> the object data at all. So when we try to dig up the actual
> size of the object, we'll hit that same cycle and kick in
> our usual complain-and-try-another-source code.

I agree with all of the above reasoning.

> Actually, skimming the sha1_file code, I am not 100% sure that we detect
> cycles in OBJ_REF_DELTA (you cannot have cycles in OBJ_OFS_DELTA since
> they always point backwards in the pack). But if that is the case, then
> I think we should fix that, not worry about special-casing it here.

Yes, but sha1_file.c?  It is the reading side and it is too late if
we notice a problem, I would think.

> +/*
> + * Drop an on-disk delta we were planning to reuse. Naively, this would
> + * just involve blanking out the "delta" field, but we have to deal
> + * with two extra pieces of book-keeping:
> + *
> + *   1. Removing ourselves from the delta_sibling linked list.
> + *
> + *   2. Updating our size; check_object() will have filled in the size of our
> + *      delta, but a non-delta object needs it true size.

Excellent point.

> +/*
> + * Follow the chain of deltas from this entry onward, throwing away any links
> + * that cause us to hit a cycle (as determined by the DFS state flags in
> + * the entries).
> + */
> +static void break_delta_cycles(struct object_entry *entry)
> +{
> +	/* If it's not a delta, it can't be part of a cycle. */
> +	if (!entry->delta) {
> +		entry->dfs_state = DFS_DONE;
> +		return;
> +	}
> +
> +	switch (entry->dfs_state) {
> +	case DFS_NONE:
> +		/*
> +		 * This is the first time we've seen the object. We mark it as
> +		 * part of the active potential cycle and recurse.
> +		 */
> +		entry->dfs_state = DFS_ACTIVE;
> +		break_delta_cycles(entry->delta);
> +		entry->dfs_state = DFS_DONE;
> +		break;
> +
> +	case DFS_DONE:
> +		/* object already examined, and not part of a cycle */
> +		break;
> +
> +	case DFS_ACTIVE:
> +		/*
> +		 * We found a cycle that needs broken. It would be correct to
> +		 * break any link in the chain, but it's convenient to
> +		 * break this one.
> +		 */
> +		drop_reused_delta(entry);
> +		break;
> +	}
> +}

Do we need to do anything to the DFS state of an entry when
drop_reused_delta() resets its other fields?  If we had this and
started from A (read "A--->B" as "A is based on B"):

    A--->B--->C--->A

we paint A as ACTIVE, visit B and then C and paint them as active,
and when we visit A for the second time, we drop it (i.e. break the
link between A and B), return and paint C as DONE, return and paint
B as DONE, and leaving A painted as ACTIVE, while the chain is now

         B--->C--->A

If we later find D that is directly based on A, wouldn't we end up
visiting A and attempt to drop it again?  drop_reused_delta() is
idempotent so there will be no data structure corruption, I think,
but we can safely declare that the entry is now DONE after calling
drop_reused_delta() on it (either in the function or in the caller
after it calls the function), no?

> + 2. Picking the next pack to examine based on locality (i.e., where we found
> +    something else recently).
> +
> +    In this case, we want to make sure that we find the delta versions of A and
> +    B and not their base versions. We can do this by putting two blobs in each
> +    pack. The first is a "dummy" blob that can only be found in the pack in
> +    question.  And then the second is the actual delta we want to find.
> +
> +    The two blobs must be present in the same tree, not present in other trees,
> +    and the dummy pathname must sort before the delta path.

> +# Create a pack containing the the tree $1 and blob $1:file, with
> +# the latter stored as a delta against $2:file.
> +#
> +# We convince pack-objects to make the delta in the direction of our choosing
> +# by marking $2 as a preferred-base edge. That results in $1:file as a thin
> +# delta, and index-pack completes it by adding $2:file as a base.

Tricky but clever and correct ;-)

> +make_pack () {
> +	 {
> +		 echo "-$(git rev-parse $2)"

Is everybody's 'echo' happy with dash followed by unknown string?

> +		 echo "$(git rev-parse $1:dummy) dummy"
> +		 echo "$(git rev-parse $1:file) file"
> +	 } |
> +	 git pack-objects --stdout |
> +	 git index-pack --stdin --fix-thin

An alternative

	git pack-objects --stdout <<-EOF |
	-$(git rev-parse $2)
        $(git rev-parse $1:dummy) dummy
        $(git rev-parse $1:file) file
	EOF
        git index-pack --stdin --fix-thin

looks somewhat ugly, though.

> +}
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html