Re: [PATCH v3] repack: enable bitmaps by default on bare repos

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, May 23 2019, Jeff King wrote:

> On Wed, Apr 10, 2019 at 06:57:21PM -0400, Jeff King wrote:
>
>>   2. The answer we get from a bitmap versus a regular traversal are not
>>      apples-to-apples equivalent. The regular traversal walks down to
>>      the UNINTERESTING commits, marks the boundaries trees and blobs as
>>      UNINTERESTING, and then adds in all the interesting trees and blobs
>>      minus the UNINTERESTING parts. So it can sometimes give the wrong
>>      answer, claiming something is interesting when it is not.
>>
>>      Whereas with bitmaps we fill in the trees and blobs as we walk, and
>>      you get the true answer. But it means we may open up a lot more
>>      trees than the equivalent traversal would.
>>
>>      So one thing I've never really experimented with (and indeed, never
>>      really thought about until writing this email) is that the bitmaps
>>      could try to do that looser style of traversal, knowing that we
>>      might err on the side of calling things interesting in a few cases.
>>      But hopefully spending a lot less time opening trees.
>>
>>      I'm not even 100% sure what that would look like in code, but just
>>      thinking about it from a high level, I don't there's a particular
>>      mathematical reason it couldn't work.
>
> I spent a while thinking and experimenting with this tonight. The result
> is the patch below. Ævar, do you still have a copy of the repo that
> misbehaved? I'd be curious to see how it fares.

No, sorry. I think we could make an artificial test to emulate it, which
would be something like:

 * ~1 million commits
 * local clone setup to fetch all branches/tags (default 'git clone')
 * local a bit ahead/behind
 * Have old "main" pack with *.bitmap, bunch of other packs/loose objects without that
 * push try to push the local change upstream (or to a topic branch)

I tried briefly to emulate this with git.git with:

    (
        rm -rf /tmp/git /tmp/git.old &&
        git init /tmp/git.old &&
        git clone --bare https://github.com/git/git.git /tmp/git &&
        cd /tmp/git &&
        old_commit=$(git rev-parse v2.20.0^{}) &&
        git rev-list v2.12.0..v2.21.0|parallel 'git branch topic-{} {}' &&
        cd /tmp/git.old &&
        echo /tmp/git/objects >.git/objects/info/alternates &&
        git branch master $old_commit &&
        git reset --hard master &&
        git repack -Adb &&
        rm .git/objects/info/alternates &&
        for c in {1..10}
        do
            >$c &&
            git add $c &&
            git commit -m"bump $c"
        done
    )

But didn't get anywhere, probably because here my topics are all stuff I
have already, and I just have 2x packs.

It would be really cool to have some test tool that could produce
various "shape of repo" states like that. I.e. given an end-state of a
public repo emulate various plausible local client states like that
given some assumptions about how often the local client fetches, what
the GC settings are etc.

> Finding the right trees to explore is a little tricky with bitmaps.  In
> a normal traversal, we consider the "edges" to be worth exploring.
> I.e., the places where an UNINTERESTING commit is the parent of an
> interesting one.
>
> But with bitmaps, we don't have that information in the same way,
> because we're trying to avoid walking the commits in the first place. So
> e.g., in a history like this:
>
>   A--B--C
>       \
>        D
>
> Let's imagine we're computing "C..D", and that D has a bitmap on disk
> but C does not. When we visit D, we'll see that it has a bitmap, fill in
> the results in our cumulative "want" bitmap, and then stop walking its
> parents (because we know everything they could tell us already).
>
> Then we visit C. It's not bitmapped, so we keep walking. Unlike a normal
> traversal, we can't stop walking even though there are only
> UNINTERESTING commits left, because we have to fill the complete bitmap,
> including A and B (and you can imagine there might be thousands of
> ancestors of A, too). The reason is that we skipped adding ancestors of
> D when we saw its bitmap, so no still_interesting() cutoff will work
> there.
>
> But how do we know that it's worth looking into the tree of "B"? If we
> assume we visit commits before their ancestors (because of the commit
> timestamp ordering), then we'd see that "B" is in the "want" bitmap
> already, which makes it worth visiting its tree.
>
> Unfortunately we'd then continue on to "A", and it would _also_ look
> interesting, because it's also in the "want" bitmap. We don't have an
> easy way of knowing that "A" is basically already covered by "B", and is
> therefore not worth pursuing. In this example, we could pass down a bit
> from B to A as we traverse. But you can imagine another interesting
> commit branched from A, which _would_ make A's tree worth considering.
>
> Fundamentally, as soon as we load a bitmap and stop traversing, we lose
> all information about the graph structure.
>
> So the patch below just looks at every tree that might be worth
> exploring (so both "A" and "B" here, but not "C"). That shouldn't be any
> _worse_ than what the current code does (because it looks at all the
> trees). It appears to behave correctly, at least so far as passing the
> test suite. Running p5310 and p5311 against git.git and linux.git, it
> seems to perform worse. I'm not quite sure why that is (i.e., if I
> screwed up something in the implementation, or if the algorithm is
> fundamentally worse).
>
> There are a lot of rough edges in the patch; I was just trying to see if
> the idea was even viable. So don't bother reviewing it as a real
> proposal for inclusion. If you do read it, I'd recommend just looking at
> the post-image, since it's essentially a rewrite and the diff is a bit
> messy.
>
> -Peff
>
> ---
>  pack-bitmap.c | 398 ++++++++++++++++++++++++--------------------------
>  1 file changed, 193 insertions(+), 205 deletions(-)
>
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 6069b2fe55..4a40f62a38 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -12,6 +12,8 @@
>  #include "packfile.h"
>  #include "repository.h"
>  #include "object-store.h"
> +#include "blob.h"
> +#include "prio-queue.h"
>
>  /*
>   * An entry on the bitmap index, representing the bitmap for a given
> @@ -422,180 +424,22 @@ static int ext_index_add_object(struct bitmap_index *bitmap_git,
>  	return bitmap_pos + bitmap_git->pack->num_objects;
>  }
>
> -struct bitmap_show_data {
> -	struct bitmap_index *bitmap_git;
> -	struct bitmap *base;
> -};
> -
> -static void show_object(struct object *object, const char *name, void *data_)
> -{
> -	struct bitmap_show_data *data = data_;
> -	int bitmap_pos;
> -
> -	bitmap_pos = bitmap_position(data->bitmap_git, &object->oid);
> -
> -	if (bitmap_pos < 0)
> -		bitmap_pos = ext_index_add_object(data->bitmap_git, object,
> -						  name);
> -
> -	bitmap_set(data->base, bitmap_pos);
> -}
> -
> -static void show_commit(struct commit *commit, void *data)
> -{
> -}
> -
> -static int add_to_include_set(struct bitmap_index *bitmap_git,
> -			      struct include_data *data,
> -			      const struct object_id *oid,
> -			      int bitmap_pos)
> -{
> -	khiter_t hash_pos;
> -
> -	if (data->seen && bitmap_get(data->seen, bitmap_pos))
> -		return 0;
> -
> -	if (bitmap_get(data->base, bitmap_pos))
> -		return 0;
> -
> -	hash_pos = kh_get_oid_map(bitmap_git->bitmaps, *oid);
> -	if (hash_pos < kh_end(bitmap_git->bitmaps)) {
> -		struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, hash_pos);
> -		bitmap_or_ewah(data->base, lookup_stored_bitmap(st));
> -		return 0;
> -	}
> -
> -	bitmap_set(data->base, bitmap_pos);
> -	return 1;
> -}
> -
> -static int should_include(struct commit *commit, void *_data)
> -{
> -	struct include_data *data = _data;
> -	int bitmap_pos;
> -
> -	bitmap_pos = bitmap_position(data->bitmap_git, &commit->object.oid);
> -	if (bitmap_pos < 0)
> -		bitmap_pos = ext_index_add_object(data->bitmap_git,
> -						  (struct object *)commit,
> -						  NULL);
> -
> -	if (!add_to_include_set(data->bitmap_git, data, &commit->object.oid,
> -				bitmap_pos)) {
> -		struct commit_list *parent = commit->parents;
> -
> -		while (parent) {
> -			parent->item->object.flags |= SEEN;
> -			parent = parent->next;
> -		}
> -
> -		return 0;
> -	}
> -
> -	return 1;
> -}
> -
> -static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
> -				   struct rev_info *revs,
> -				   struct object_list *roots,
> -				   struct bitmap *seen)
> +static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
> +				struct bitmap **base, struct commit *commit)
>  {
> -	struct bitmap *base = NULL;
> -	int needs_walk = 0;
> -
> -	struct object_list *not_mapped = NULL;
> -
> -	/*
> -	 * Go through all the roots for the walk. The ones that have bitmaps
> -	 * on the bitmap index will be `or`ed together to form an initial
> -	 * global reachability analysis.
> -	 *
> -	 * The ones without bitmaps in the index will be stored in the
> -	 * `not_mapped_list` for further processing.
> -	 */
> -	while (roots) {
> -		struct object *object = roots->item;
> -		roots = roots->next;
> -
> -		if (object->type == OBJ_COMMIT) {
> -			khiter_t pos = kh_get_oid_map(bitmap_git->bitmaps, object->oid);
> -
> -			if (pos < kh_end(bitmap_git->bitmaps)) {
> -				struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, pos);
> -				struct ewah_bitmap *or_with = lookup_stored_bitmap(st);
> -
> -				if (base == NULL)
> -					base = ewah_to_bitmap(or_with);
> -				else
> -					bitmap_or_ewah(base, or_with);
> -
> -				object->flags |= SEEN;
> -				continue;
> -			}
> -		}
> -
> -		object_list_insert(object, &not_mapped);
> -	}
> -
> -	/*
> -	 * Best case scenario: We found bitmaps for all the roots,
> -	 * so the resulting `or` bitmap has the full reachability analysis
> -	 */
> -	if (not_mapped == NULL)
> -		return base;
> -
> -	roots = not_mapped;
> -
> -	/*
> -	 * Let's iterate through all the roots that don't have bitmaps to
> -	 * check if we can determine them to be reachable from the existing
> -	 * global bitmap.
> -	 *
> -	 * If we cannot find them in the existing global bitmap, we'll need
> -	 * to push them to an actual walk and run it until we can confirm
> -	 * they are reachable
> -	 */
> -	while (roots) {
> -		struct object *object = roots->item;
> -		int pos;
> -
> -		roots = roots->next;
> -		pos = bitmap_position(bitmap_git, &object->oid);
> -
> -		if (pos < 0 || base == NULL || !bitmap_get(base, pos)) {
> -			object->flags &= ~UNINTERESTING;
> -			add_pending_object(revs, object, "");
> -			needs_walk = 1;
> -		} else {
> -			object->flags |= SEEN;
> -		}
> -	}
> -
> -	if (needs_walk) {
> -		struct include_data incdata;
> -		struct bitmap_show_data show_data;
> -
> -		if (base == NULL)
> -			base = bitmap_new();
> -
> -		incdata.bitmap_git = bitmap_git;
> -		incdata.base = base;
> -		incdata.seen = seen;
> -
> -		revs->include_check = should_include;
> -		revs->include_check_data = &incdata;
> -
> -		if (prepare_revision_walk(revs))
> -			die("revision walk setup failed");
> +	khiter_t pos = kh_get_oid_map(bitmap_git->bitmaps, commit->object.oid);
> +	if (pos < kh_end(bitmap_git->bitmaps)) {
> +		struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, pos);
> +		struct ewah_bitmap *or_with = lookup_stored_bitmap(st);
>
> -		show_data.bitmap_git = bitmap_git;
> -		show_data.base = base;
> +		if (*base == NULL)
> +			*base = ewah_to_bitmap(or_with);
> +		else
> +			bitmap_or_ewah(*base, or_with);
>
> -		traverse_commit_list(revs, show_commit, show_object,
> -				     &show_data);
> +		return 1;
>  	}
> -
> -	return base;
> +	return 0;
>  }
>
>  static void show_extended_objects(struct bitmap_index *bitmap_git,
> @@ -665,24 +509,122 @@ static void show_objects_for_type(
>  	}
>  }
>
> -static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
> -			     struct object_list *roots)
> +static void add_object_to_bitmap(struct bitmap_index *bitmap_git,
> +				 struct object *obj,
> +				 struct bitmap *bitmap,
> +				 struct bitmap *seen,
> +				 int ignore_missing);
> +
> +static void add_tag_to_bitmap(struct bitmap_index *bitmap_git,
> +			      struct tag *tag,
> +			      struct bitmap *bitmap,
> +			      struct bitmap *seen,
> +			      int ignore_missing)
> +{
> +	if (parse_tag(tag) < 0 || !tag->tagged) {
> +		if (ignore_missing)
> +			return;
> +		die("unable to parse tag %s", oid_to_hex(&tag->object.oid));
> +	}
> +	add_object_to_bitmap(bitmap_git, tag->tagged, bitmap, seen,
> +			     ignore_missing);
> +}
> +
> +static void add_tree_to_bitmap(struct bitmap_index *bitmap_git,
> +			       struct tree *tree,
> +			       struct bitmap *bitmap,
> +			       struct bitmap *seen,
> +			       int ignore_missing)
>  {
> -	while (roots) {
> -		struct object *object = roots->item;
> -		roots = roots->next;
> +	struct tree_desc desc;
> +	struct name_entry entry;
>
> -		if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
> -			return 1;
> +	if (parse_tree(tree) < 0) {
> +		if (ignore_missing)
> +			return;
> +		die("unable to parse tree %s", oid_to_hex(&tree->object.oid));
>  	}
>
> -	return 0;
> +	init_tree_desc(&desc, tree->buffer, tree->size);
> +	while (tree_entry(&desc, &entry)) {
> +		if (S_ISDIR(entry.mode)) {
> +			struct tree *t = lookup_tree(the_repository, &entry.oid);
> +			if (!t) {
> +				die(_("entry '%s' in tree %s has tree mode, "
> +				      "but is not a tree"),
> +				    entry.path, oid_to_hex(&tree->object.oid));
> +			}
> +			add_object_to_bitmap(bitmap_git, &t->object,
> +					     bitmap, seen, ignore_missing);
> +		} else if (!S_ISGITLINK(entry.mode)) {
> +			struct blob *b = lookup_blob(the_repository, &entry.oid);
> +			if (!b) {
> +				die(_("entry '%s' in tree %s has blob mode, "
> +				      "but is not a blob"),
> +				    entry.path, oid_to_hex(&tree->object.oid));
> +			}
> +			add_object_to_bitmap(bitmap_git, &b->object,
> +					     bitmap, seen, ignore_missing);
> +		}
> +	}
> +
> +	free_tree_buffer(tree);
> +}
> +
> +static void add_object_to_bitmap(struct bitmap_index *bitmap_git,
> +				 struct object *obj,
> +				 struct bitmap *bitmap,
> +				 struct bitmap *seen,
> +				 int ignore_missing)
> +{
> +	int pos = bitmap_position(bitmap_git, &obj->oid);
> +
> +	if (pos < 0)
> +		pos = ext_index_add_object(bitmap_git, obj, NULL);
> +
> +	if (bitmap_get(bitmap, pos))
> +		return; /* already there */
> +
> +	if (seen && bitmap_get(seen, pos))
> +		return; /* seen in other bitmap; not worth digging further */
> +
> +	bitmap_set(bitmap, pos);
> +
> +	if (obj->type == OBJ_BLOB)
> +		return;
> +	else if (obj->type == OBJ_TAG)
> +		add_tag_to_bitmap(bitmap_git, (struct tag *)obj,
> +				  bitmap, seen, ignore_missing);
> +	else if (obj->type == OBJ_TREE)
> +		add_tree_to_bitmap(bitmap_git, (struct tree *)obj,
> +				   bitmap, seen, ignore_missing);
> +	else
> +		BUG("unexpected object type: %d", obj->type);
> +}
> +
> +static void add_objects_to_bitmap(struct bitmap_index *bitmap_git,
> +				  struct object_list **list,
> +				  struct bitmap *bitmap,
> +				  struct bitmap *seen,
> +				  int ignore_missing)
> +{
> +	while (*list) {
> +		struct object_list *cur = *list;
> +
> +		add_object_to_bitmap(bitmap_git, cur->item,
> +				     bitmap, seen, ignore_missing);
> +
> +		*list = cur->next;
> +		free(cur);
> +	}
>  }
>
>  struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
>  {
>  	unsigned int i;
>
> +	struct prio_queue commits = { compare_commits_by_commit_date };
> +
>  	struct object_list *wants = NULL;
>  	struct object_list *haves = NULL;
>
> @@ -714,24 +656,16 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
>  			object = parse_object_or_die(&tag->tagged->oid, NULL);
>  		}
>
> -		if (object->flags & UNINTERESTING)
> -			object_list_insert(object, &haves);
> -		else
> -			object_list_insert(object, &wants);
> +		if (object->type == OBJ_COMMIT)
> +			prio_queue_put(&commits, object);
> +		else {
> +			if (object->flags & UNINTERESTING)
> +				object_list_insert(object, &haves);
> +			else
> +				object_list_insert(object, &wants);
> +		}
>  	}
>
> -	/*
> -	 * if we have a HAVES list, but none of those haves is contained
> -	 * in the packfile that has a bitmap, we don't have anything to
> -	 * optimize here
> -	 */
> -	if (haves && !in_bitmapped_pack(bitmap_git, haves))
> -		goto cleanup;
> -
> -	/* if we don't want anything, we're done here */
> -	if (!wants)
> -		goto cleanup;
> -
>  	/*
>  	 * now we're going to use bitmaps, so load the actual bitmap entries
>  	 * from disk. this is the point of no return; after this the rev_list
> @@ -742,20 +676,74 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
>
>  	object_array_clear(&revs->pending);
>
> -	if (haves) {
> -		revs->ignore_missing_links = 1;
> -		haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
> -		reset_revision_walk();
> -		revs->ignore_missing_links = 0;
> +	haves_bitmap = bitmap_new();
> +	wants_bitmap = bitmap_new();
>
> -		if (haves_bitmap == NULL)
> -			BUG("failed to perform bitmap walk");
> -	}
> +	/*
> +	 * First traverse the relevant commits as we would for a normal
> +	 * traversal.
> +	 */
> +	while (commits.nr) {
> +		struct commit *commit = prio_queue_get(&commits);
> +		struct bitmap **dst_bitmap;
> +		int pos;
> +		struct commit_list *parent;
> +
> +		if (commit->object.flags & UNINTERESTING)
> +			dst_bitmap = &haves_bitmap;
> +		else
> +			dst_bitmap = &wants_bitmap;
>
> -	wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap);
> +		pos = bitmap_position(bitmap_git, &commit->object.oid);
> +		if (pos >= 0 && *dst_bitmap && bitmap_get(*dst_bitmap, pos))
> +			continue; /* already covered */
>
> -	if (!wants_bitmap)
> -		BUG("failed to perform bitmap walk");
> +		if (add_commit_to_bitmap(bitmap_git, dst_bitmap, commit))
> +			continue; /* skip adding parents, they're covered */
> +
> +
> +		/* otherwise mark ourselves and queue dependent objects */
> +		if (pos < 0)
> +			pos = ext_index_add_object(bitmap_git, &commit->object, NULL);
> +		bitmap_set(*dst_bitmap, pos);
> +		if (parse_commit(commit)) {
> +			if (commit->object.flags & UNINTERESTING)
> +				continue;
> +			die("unable to parse commit %s",
> +			    oid_to_hex(&commit->object.oid));
> +		}
> +		if (commit->object.flags & UNINTERESTING) {
> +			/*
> +			 * If an uninteresting commit is in the "wants" bitmap,
> +			 * then our tree is worth exploring. This means we may
> +			 * miss some trees in the "haves" that are not
> +			 * ancestors of "wants" (or that are, but we missed
> +			 * because of out-of-order timestamps).
> +			 */
> +			if (wants_bitmap && bitmap_get(wants_bitmap, pos))
> +				object_list_insert(&get_commit_tree(commit)->object,
> +						   &haves);
> +		} else {
> +			/*
> +			 * "wants" must always be explored
> +			 */
> +			object_list_insert(&get_commit_tree(commit)->object,
> +					   &wants);
> +		}
> +
> +		for (parent = commit->parents; parent; parent = parent->next) {
> +			if (commit->object.flags & UNINTERESTING)
> +				parent->item->object.flags |= UNINTERESTING;
> +			prio_queue_put(&commits, parent->item);
> +		}
> +	}
> +
> +	/*
> +	 * At this point we've processed all of the commits, and queued items
> +	 * in "haves" and "wants" that need to be marked.
> +	 */
> +	add_objects_to_bitmap(bitmap_git, &haves, haves_bitmap, NULL, 1);
> +	add_objects_to_bitmap(bitmap_git, &wants, wants_bitmap, haves_bitmap, 0);
>
>  	if (haves_bitmap)
>  		bitmap_and_not(wants_bitmap, haves_bitmap);




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux