Re: [PATCH v7 21/22] gc: automatically write commit-graph files

Derrick Stolee <stolee@xxxxxxxxx> · Wed, 27 Jun 2018 14:24:09 -0400




On 6/27/2018 2:09 PM, Junio C Hamano wrote:
Derrick Stolee <stolee@xxxxxxxxx> writes:

@@ -40,6 +41,7 @@ static int aggressive_depth = 50;
  static int aggressive_window = 250;
  static int gc_auto_threshold = 6700;
  static int gc_auto_pack_limit = 50;
+static int gc_write_commit_graph = 0;
Please avoid unnecessary (and undesirable) explicit initialization
to 0.  Instead, let BSS to handle it by leaving " = 0" out.

+test_expect_success 'check that gc computes commit-graph' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git commit --allow-empty -m "blank" &&
+	git commit-graph write --reachable &&
+	cp $objdir/info/commit-graph commit-graph-before-gc &&
+	git reset --hard HEAD~1 &&
+	git config gc.writeCommitGraph true &&
+	git gc &&
+	cp $objdir/info/commit-graph commit-graph-after-gc &&
+	! test_cmp commit-graph-before-gc commit-graph-after-gc &&
The set of commits in the commit graph will chanbe by discarding the
(old) tip commit, so the fact that the contents of the file changed
across gc proves that "commit-graph write" was triggered during gc.

Come to think of it, do we promise to end-users (in docs etc.) that
commit-graph covers *ONLY* commits reachable from refs and HEAD?  I
am wondering what should happen if "git gc" here does not prune the
reflog for HEAD---wouldn't we want to reuse the properties of the
commit we are discarding when it comes back (e.g. you push, then
reset back, and the next pull makes it reappear in your repository)?

Today I learned that 'gc' keeps some of the reflog around. That makes 
sense, but I wouldn't optimize the commit-graph file for this scenario.

I guess what I am really questioning is if it is sensible to define
"--reachable" as "starting at all refs", unlike the usual connectivity
rules "gc" uses, especially when this is run from inside "gc".

It is sensible to me, especially because we only lose performance if we 
visit those other commits that are still in the object database. By 
writing the commit-graph on 'gc' and not during 'fetch', we are already 
assuming the commit-graph will usually be behind the set of commits that 
the user cares about, by design.

An alternate view on the decision will need help answering from others 
who know more than me: In fetch negotiation, does the client report 
commits in the reflog as 'have's or do they get re-downloaded on a 
resulting 'git pull'?


+	git commit-graph write --reachable &&
+	test_cmp commit-graph-after-gc $objdir/info/commit-graph
This says that running "commit-graph write" twice without changing
the topology MUST yield byte-for-byte identical commit-graph file.

Is that a reasonable requirement on the future implementation?  I am
wondering if there will arise a situation where you need to store
records in "some" fixed order but two records compare "the same" and
tie-breaking them to give stable sort is expensive, or something
like that, which would benefit if you leave an escape hatch to allow
two logically identical graphs expressed bitwise differently.

Since the file format allows flexibility in the order of the chunks, it 
is possible to have bitwise-different files that represent the same set 
of data. However, I would not want git to provide inconsistent output 
given the same set of commits covered by the file.

Thanks,
-Stolee