[PATCH v4 0/6] Speed up connectivity checks

Patrick Steinhardt <ps@xxxxxx> · Thu, 5 Aug 2021 13:25:19 +0200

Hi,

this is version 4 of my series to speed up connectivity checks. Given
that v3 has received positive feedback, I finally stuck to the approach
and only have a bunch of iterative changes based on your feedback.

Changes compared to v3:

    - Patch 1/6 is new and splits up revs->no_walk into revs->no_walk
      which now only indicates whether to walk and revs->unserted_input,
      which indicates whether to sort input.

    - Patch 2/6 got some documentation for the new `--unsorted-input`
      option. Furthermore, we refuse `--no-walk`/`--no-walk=sorted` and
      `--unsorted-input` if used together. I've also added some tests
      for the new option.

    - Patch 3/6 has an updated commit message, detailing that
      `add_pending_oid()` only is a thin wrapper around
      `add_pending_object()`.

    - Patch 4/6 has an update commit message, stating that it's a
      prerequisite for 6/6.

    - Patch 5/6 is new, splitting out the logic to find a commit ID in
      the commit graph as a prerequisite for 6/6.

    - Patch 6/6 now also verifies that commits parsed only via the
      commit-graph exist in the ODB. I've also renamed
      `find_object_in_graph()` to `parse_commit_in_graph_gently()` to
      better reflect what the function does.

With the added existence check in 6/6, the speedup is not as big as
before (1.47x faster instead of 1.55x). But it's still very much worth
it. In total, this patch series decreases `git rev-list --objects
--unsorted --not --all --not $newrev` from 7.6s to 3.0s in my test
repository.

Thanks for all your feedback!

Patrick

Patrick Steinhardt (6):
  revision: separate walk and unsorted flags
  connected: do not sort input revisions
  revision: stop retrieving reference twice
  revision: avoid loading object headers multiple times
  commit-graph: split out function to search commit position
  revision: avoid hitting packfiles when commits are in commit-graph

 Documentation/rev-list-options.txt |  8 +++-
 builtin/log.c                      |  2 +-
 builtin/revert.c                   |  3 +-
 commit-graph.c                     | 75 +++++++++++++++++++++---------
 commit-graph.h                     |  7 +++
 connected.c                        |  1 +
 revision.c                         | 52 +++++++++++++++++----
 revision.h                         |  7 +--
 t/t6000-rev-list-misc.sh           | 38 +++++++++++++++
 9 files changed, 153 insertions(+), 40 deletions(-)

Range-diff against v3:
-:  ---------- > 1:  67232910ac revision: separate walk and unsorted flags
1:  1fd83f726a ! 2:  9d7f484907 connected: do not sort input revisions
    @@ Commit message

         Signed-off-by: Patrick Steinhardt <ps@xxxxxx>

    + ## Documentation/rev-list-options.txt ##
    +@@ Documentation/rev-list-options.txt: list of the missing objects.  Object IDs are prefixed with a ``?'' character.
    + 	objects.
    + endif::git-rev-list[]
    + 
    ++--unsorted-input::
    ++	Show commits in the order they were given on the command line instead
    ++	of sorting them in reverse chronological order by commit time. Cannot
    ++	be combined with `--no-walk` or `--no-walk=sorted`.
    ++
    + --no-walk[=(sorted|unsorted)]::
    + 	Only show the given commits, but do not traverse their ancestors.
    + 	This has no effect if a range is specified. If the argument
    +@@ Documentation/rev-list-options.txt: endif::git-rev-list[]
    + 	given on the command line. Otherwise (if `sorted` or no argument
    + 	was given), the commits are shown in reverse chronological order
    + 	by commit time.
    +-	Cannot be combined with `--graph`.
    ++	Cannot be combined with `--graph`. Cannot be combined with
    ++	`--unsorted-input` if `sorted` or no argument was given.
    + 
    + --do-walk::
    + 	Overrides a previous `--no-walk`.
    +
      ## connected.c ##
     @@ connected.c: int check_connected(oid_iterate_fn fn, void *cb_data,
      	if (opt->progress)
    @@ revision.c: static int handle_revision_opt(struct rev_info *revs, int argc, cons
      		revs->sort_order = REV_SORT_BY_AUTHOR_DATE;
      		revs->topo_order = 1;
     +	} else if (!strcmp(arg, "--unsorted-input")) {
    ++		if (revs->no_walk && !revs->unsorted_input)
    ++			die(_("--unsorted-input is incompatible with --no-walk and --no-walk=sorted"));
     +		revs->unsorted_input = 1;
      	} else if (!strcmp(arg, "--early-output")) {
      		revs->early_output = 100;
      		revs->topo_order = 1;
    -@@ revision.c: int prepare_revision_walk(struct rev_info *revs)
    - 
    - 	if (!revs->reflog_info)
    - 		prepare_to_use_bloom_filter(revs);
    --	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
    -+	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED && !revs->unsorted_input)
    - 		commit_list_sort_by_date(&revs->commits);
    - 	if (revs->no_walk)
    - 		return 0;
    +@@ revision.c: static int handle_revision_pseudo_opt(const char *submodule,
    + 	} else if (!strcmp(arg, "--not")) {
    + 		*flags ^= UNINTERESTING | BOTTOM;
    + 	} else if (!strcmp(arg, "--no-walk")) {
    ++		if (revs->unsorted_input)
    ++			die(_("--no-walk is incompatible with --no-walk=unsorted and --unsorted-input"));
    + 		revs->no_walk = 1;
    + 	} else if (skip_prefix(arg, "--no-walk=", &optarg)) {
    + 		/*
    +@@ revision.c: static int handle_revision_pseudo_opt(const char *submodule,
    + 		 * not allowed, since the argument is optional.
    + 		 */
    + 		revs->no_walk = 1;
    +-		if (!strcmp(optarg, "sorted"))
    ++		if (!strcmp(optarg, "sorted")) {
    ++			if (revs->unsorted_input)
    ++				die(_("--no-walk=sorted is incompatible with --no-walk=unsorted "
    ++				    "and --unsorted-input"));
    + 			revs->unsorted_input = 0;
    +-		else if (!strcmp(optarg, "unsorted"))
    ++		} else if (!strcmp(optarg, "unsorted"))
    + 			revs->unsorted_input = 1;
    + 		else
    + 			return error("invalid argument to --no-walk");

    - ## revision.h ##
    -@@ revision.h: struct rev_info {
    - 			simplify_history:1,
    - 			show_pulls:1,
    - 			topo_order:1,
    -+			unsorted_input:1,
    - 			simplify_merges:1,
    - 			simplify_by_decoration:1,
    - 			single_worktree:1,
    + ## t/t6000-rev-list-misc.sh ##
    +@@ t/t6000-rev-list-misc.sh: test_expect_success 'rev-list --count --objects' '
    + 	test_line_count = $count actual
    + '
    + 
    ++test_expect_success 'rev-list --unsorted-input results in different sorting' '
    ++	git rev-list --unsorted-input HEAD HEAD~ >first &&
    ++	git rev-list --unsorted-input HEAD~ HEAD >second &&
    ++	! test_cmp first second &&
    ++	sort first >first.sorted &&
    ++	sort second >second.sorted &&
    ++	test_cmp first.sorted second.sorted
    ++'
    ++
    ++test_expect_success 'rev-list --unsorted-input compatible with --no-walk=unsorted' '
    ++	git rev-list --unsorted-input --no-walk=unsorted HEAD HEAD~ >actual &&
    ++	git rev-parse HEAD >expect &&
    ++	git rev-parse HEAD~ >>expect &&
    ++	test_cmp expect actual
    ++'
    ++
    ++test_expect_success 'rev-list --unsorted-input incompatible with --no-walk=sorted' '
    ++	cat >expect <<-EOF &&
    ++		fatal: --no-walk is incompatible with --no-walk=unsorted and --unsorted-input
    ++	EOF
    ++	test_must_fail git rev-list --unsorted-input --no-walk HEAD 2>error &&
    ++	test_cmp expect error &&
    ++
    ++	cat >expect <<-EOF &&
    ++		fatal: --no-walk=sorted is incompatible with --no-walk=unsorted and --unsorted-input
    ++	EOF
    ++	test_must_fail git rev-list --unsorted-input --no-walk=sorted HEAD 2>error &&
    ++	test_cmp expect error &&
    ++
    ++	cat >expect <<-EOF &&
    ++		fatal: --unsorted-input is incompatible with --no-walk and --no-walk=sorted
    ++	EOF
    ++	test_must_fail git rev-list --no-walk --unsorted-input HEAD 2>error &&
    ++	test_cmp expect error &&
    ++	test_must_fail git rev-list --no-walk=sorted --unsorted-input HEAD 2>error &&
    ++	test_cmp expect error
    ++'
    ++
    + test_done
2:  db85480649 ! 3:  d8e63d0943 revision: stop retrieving reference twice
    @@ Commit message
         revision: stop retrieving reference twice

         When queueing up references for the revision walk, `handle_one_ref()`
    -    will resolve the reference's object ID and then queue the ID as pending
    -    object via `add_pending_oid()`. But `add_pending_oid()` will again try
    -    to resolve the object ID to an object, effectively duplicating the work
    -    its caller already did before.
    +    will resolve the reference's object ID via `get_reference()` and then
    +    queue the ID as pending object via `add_pending_oid()`. But given that
    +    `add_pending_oid()` is only a thin wrapper around `add_pending_object()`
    +    which fist calls `get_reference()`, we effectively resolve the reference
    +    twice and thus duplicate some of the work.

    -    Fix the issue by instead calling `add_pending_object()`, which takes the
    -    already-resolved object as input. In a repository with lots of refs,
    -    this translates in a nearly 10% speedup:
    +    Fix the issue by instead calling `add_pending_object()` directly, which
    +    takes the already-resolved object as input. In a repository with lots of
    +    refs, this translates into a near 10% speedup:

             Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
               Time (mean ± σ):      5.015 s ±  0.038 s    [User: 4.698 s, System: 0.316 s]
3:  b9897e102a ! 4:  ba8df5cad0 revision: avoid loading object headers multiple times
    @@ Commit message
         either a tag or a commit, we'd have a modest increase in memory
         consumption of about 12.5% here.

    +    Note that on its own, this patch may not seem like a clear win. But it
    +    is a prerequisite for the following patch, which will result in another
    +    37% speedup.
    +
         Signed-off-by: Patrick Steinhardt <ps@xxxxxx>

      ## revision.c ##
-:  ---------- > 5:  e33cd51ebf commit-graph: split out function to search commit position
4:  f6fc2a5e6d ! 6:  900c5a9c60 revision: avoid hitting packfiles when commits are in commit-graph
    @@ Commit message
         directly fill in the commit object, otherwise we can still hit the disk
         to determine the object's type.

    -    Expose a new function `find_object_in_graph()`, which fills in an object
    -    of unknown type in case we find its object ID in the graph. This
    -    provides a big performance win in cases where there is a commit-graph
    -    available in the repository in case we load lots of references. The
    -    following has been executed in a real-world repository with about 2.2
    -    million refs:
    +    Expose a new function `parse_commit_in_graph_gently()`, which fills in
    +    an object of unknown type in case we find its object ID in the graph.
    +    This provides a big performance win in cases where there is a
    +    commit-graph available in the repository in case we load lots of
    +    references. The following has been executed in a real-world repository
    +    with about 2.2 million refs:

             Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
    -          Time (mean ± σ):      4.465 s ±  0.037 s    [User: 4.144 s, System: 0.320 s]
    -          Range (min … max):    4.411 s …  4.514 s    10 runs
    +          Time (mean ± σ):      4.508 s ±  0.039 s    [User: 4.131 s, System: 0.377 s]
    +          Range (min … max):    4.455 s …  4.576 s    10 runs

             Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
    -          Time (mean ± σ):      2.886 s ±  0.032 s    [User: 2.570 s, System: 0.316 s]
    -          Range (min … max):    2.826 s …  2.933 s    10 runs
    +          Time (mean ± σ):      3.072 s ±  0.031 s    [User: 2.707 s, System: 0.365 s]
    +          Range (min … max):    3.040 s …  3.144 s    10 runs

             Summary
               'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran
    -            1.55 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'
    +            1.47 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'

         Signed-off-by: Patrick Steinhardt <ps@xxxxxx>

      ## commit-graph.c ##
    -@@ commit-graph.c: static int fill_commit_in_graph(struct repository *r,
    - 	return 1;
    +@@ commit-graph.c: static int find_commit_pos_in_graph(struct commit *item, struct commit_graph *g,
    + 	}
      }

    -+static int find_object_id_in_graph(const struct object_id *id, struct commit_graph *g, uint32_t *pos)
    -+{
    -+	struct commit_graph *cur_g = g;
    -+	uint32_t lex_index;
    -+
    -+	while (cur_g && !bsearch_graph(cur_g, (struct object_id *)id, &lex_index))
    -+		cur_g = cur_g->base_graph;
    -+
    -+	if (cur_g) {
    -+		*pos = lex_index + cur_g->num_commits_in_base;
    -+		return 1;
    -+	}
    -+
    -+	return 0;
    -+}
    -+
    -+int find_object_in_graph(struct repository *repo, struct object *object)
    ++int parse_commit_in_graph_gently(struct repository *repo, struct object *object)
     +{
     +	struct commit *commit;
     +	uint32_t pos;
    @@ commit-graph.c: static int fill_commit_in_graph(struct repository *r,
     +	if (!repo->objects->commit_graph)
     +		return -1;
     +
    -+	if (!find_object_id_in_graph(&object->oid, repo->objects->commit_graph, &pos))
    ++	if (!search_commit_pos_in_graph(&object->oid, repo->objects->commit_graph, &pos))
     +		return -1;
     +
     +	commit = object_as_type(object, OBJ_COMMIT, 1);
    @@ commit-graph.c: static int fill_commit_in_graph(struct repository *r,
     +	return 0;
     +}
     +
    - static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
    - {
    - 	uint32_t graph_pos = commit_graph_position(item);
    -@@ commit-graph.c: static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
    - 		*pos = graph_pos;
    - 		return 1;
    - 	} else {
    --		struct commit_graph *cur_g = g;
    --		uint32_t lex_index;
    --
    --		while (cur_g && !bsearch_graph(cur_g, &(item->object.oid), &lex_index))
    --			cur_g = cur_g->base_graph;
    --
    --		if (cur_g) {
    --			*pos = lex_index + cur_g->num_commits_in_base;
    --			return 1;
    --		}
    --
    --		return 0;
    -+		return find_object_id_in_graph(&item->object.oid, g, pos);
    - 	}
    - }
    - 
    + static int parse_commit_in_graph_one(struct repository *r,
    + 				     struct commit_graph *g,
    + 				     struct commit *item)

      ## commit-graph.h ##
    -@@ commit-graph.h: int write_commit_graph(struct object_directory *odb,
    - 		       enum commit_graph_write_flags flags,
    - 		       const struct commit_graph_opts *opts);
    +@@ commit-graph.h: int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
    +  */
    + int parse_commit_in_graph(struct repository *r, struct commit *item);

    -+int find_object_in_graph(struct repository *repo, struct object *object);
    ++/*
    ++ * Given an object of unknown type, try to fill in the object in case it is a
    ++ * commit part of the commit-graph. Returns 0 if the object is a parsed commit
    ++ * or if it could be filled in via the commit graph, otherwise it returns -1.
    ++ */
    ++int parse_commit_in_graph_gently(struct repository *repo, struct object *object);
     +
    - #define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)
    - 
    - int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags);
    + /*
    +  * It is possible that we loaded commit contents from the commit buffer,
    +  * but we also want to ensure the commit-graph content is correctly

      ## revision.c ##
     @@ revision.c: static struct object *get_reference(struct rev_info *revs, const char *name,
    @@ revision.c: static struct object *get_reference(struct rev_info *revs, const cha
      	if (object->type == OBJ_NONE) {
     -		int type = oid_object_info(revs->repo, oid, NULL);
     -		if (type < 0 || !object_as_type(object, type, 1)) {
    --			object = NULL;
    --			goto out;
    -+		if (find_object_in_graph(revs->repo, object) < 0) {
    ++		/*
    ++		 * It's likely that the reference points to a commit, so we
    ++		 * first try to look it up via the commit-graph. If successful,
    ++		 * then we know it's a commit and don't have to unpack the
    ++		 * object header. We still need to assert that the object
    ++		 * exists, but given that we don't request any info about the
    ++		 * object this is a lot faster than `oid_object_info()`.
    ++		 */
    ++		if (parse_commit_in_graph_gently(revs->repo, object) < 0) {
     +			int type = oid_object_info(revs->repo, oid, NULL);
     +			if (type < 0 || !object_as_type(object, type, 1)) {
     +				object = NULL;
     +				goto out;
     +			}
    ++		} else if (!repo_has_object_file(revs->repo, oid)) {
    + 			object = NULL;
    + 			goto out;
      		}
    - 	}
    - 
-- 
2.32.0

Attachment:
signature.asc

Description: PGP signature