Re: [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks

Jakub Narebski <jnareb@xxxxxxxxx> · Fri, 21 Feb 2020 23:45:51 +0100

"Garima Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

This is a second part of my response, focusing solely on tests of the
Bloom filters feature.

[...]
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index d2884efe0a..aff597c7a3 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv)
>  		printf(" commit_metadata");
>  	if (graph->chunk_extra_edges)
>  		printf(" extra_edges");
> +	if (graph->chunk_bloom_indexes)
> +		printf(" bloom_indexes");
> +	if (graph->chunk_bloom_data)
> +		printf(" bloom_data");
>  	printf("\n");
>

All right, that is simple extension of 'test-helper read-graph'.

>  	UNLEAK(graph);
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> new file mode 100755
> index 0000000000..19eca1864b
> --- /dev/null
> +++ b/t/t4216-log-bloom.sh
> @@ -0,0 +1,140 @@
> +#!/bin/sh
> +
> +test_description='git log for a path with bloom filters'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
> +	git init &&
> +	mkdir A A/B A/B/C &&
> +	test_commit c1 A/file1 &&
> +	test_commit c2 A/B/file2 &&
> +	test_commit c3 A/B/C/file3 &&
> +	test_commit c4 A/file1 &&
> +	test_commit c5 A/B/file2 &&
> +	test_commit c6 A/B/C/file3 &&
> +	test_commit c7 A/file1 &&
> +	test_commit c8 A/B/file2 &&
> +	test_commit c9 A/B/C/file3 &&
> +	git checkout -b side HEAD~4 &&
> +	test_commit side-1 file4 &&
> +	git checkout master &&
> +	git merge side &&
> +	test_commit c10 file5 &&

Unfortunately this might be not enough for Git's heuristic similarity
based rename detection, as it creates 'file5' file with content 'c10'.

[Checking something].  Well, actually it looks like it works, even with
not much contents.  I thought you would need to use something like

  +	test_write_lines 1 2 3 4 5 6 7 8 9 >file5 &&
  +	git add file5 &&
  +	git commit -m c10 &&

But it turns out that it is, s far as I have checked, not necessary.

> +	mv file5 file5_renamed &&
> +	git add file5_renamed &&
> +	git commit -m "rename" &&
> +	git commit-graph write --reachable --changed-paths
> +'

Hmmm... there is no test for file that was present in history but got
deleted.  Might be important (because of pre-image vs post-image name
issues).

Very minor issue: following the style used in t/test-lib-functions.sh
and the style guide in CodingGuidelines, it should be

  +graph_read_expect () {

and the same for the following functions.

https://github.com/git/git/blob/master/Documentation/CodingGuidelines#L144

 - We prefer a space between the function name and the parentheses,
   and no space inside the parentheses. The opening "{" should also
   be on the same line.

	(incorrect)
	my_function(){
		...

	(correct)
	my_function () {
		...

> +graph_read_expect() {
> +	OPTIONAL=""
> +	NUM_CHUNKS=5
> +	cat >expect <<- EOF
> +	header: 43475048 1 1 $NUM_CHUNKS 0
> +	num_commits: $1
> +	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data

Either OPTIONAL remains unused, and should be removed, or we leave it
for possible future extension, and we write

  +	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data$OPTIONAL

like in t/t5318-commit-graph.sh.

> +	EOF
> +	test-tool read-graph >output &&
> +	test_cmp expect output

Why 'output', and not 'actual'?

> +}
> +
> +test_expect_success 'commit-graph write wrote out the bloom chunks' '
> +	graph_read_expect 13
> +'

All right, that is sanity-checking 'git commit-graph write --changed-paths'.

> +
> +setup() {

I wonder if we can come up with a better name... setup_log(),
setup_log_bloom(), log_compare()?

> +	rm output

This shouldn't be here, in this function.  Or perhaps it shouldn't even
be used at all; having 'output' doesn't hinder anything.

> +	rm "$TRASH_DIRECTORY/trace.perf"

All right, this cleanup is needed.

> +	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom
> +	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom

All right, we prepare for comparing version without Bloom filters
(reference) and with Bloom filters, and for checking if Bloom filters
were used.

> +}

This setup() function above is missing the && chain.

It should then in my opinion read:

  +setup () {
  +	rm "$TRASH_DIRECTORY/trace.perf" &&
  +	git -c core.commitGraph=false log --format="%s" $1 >log_wo_bloom &&
  +	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" \
  +	git -c core.commitGraph=true log --format="%s" $1 >log_w_bloom
  +}

Also, perhaps we should add at the beginning of this test file, outside
anu test_expect_success block, the following (see t/*trace2*.sh files):

  # Turn off any inherited trace2 settings for this test.
  sane_unset GIT_TRACE2 GIT_TRACE2_PERF GIT_TRACE2_EVENT
  sane_unset GIT_TRACE2_PERF_BRIEF
  sane_unset GIT_TRACE2_CONFIG_PARAMS

> +
> +test_bloom_filters_used() {
> +	log_args=$1
> +	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
> +	setup "$log_args"

Missing && chain.

> +	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom

Why no line break after &&?

> +}

Ugh, examining JSON output with regexp is in my opinion quite fragile.
Though I am not sure if requiring Perl and JSON module installed like
t/t0212-trace2-event.sh is any better.

> +
> +test_bloom_filters_not_used() {
> +	log_args=$1
> +	setup "$log_args"
> +	!(grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf") && test_cmp log_wo_bloom log_w_bloom

We should also check that "$TRASH_DIRECTORY/trace.perf" file exist with
test_path_is_file.

Also, testing that something was not found is a bit fragile, but I don't
have any better idea on how to do this test without negating grep exit
value.

> +}
> +
> +for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5_renamed

NOTE: file5 is missing from this list!

I suspect that adding it might cause the test to fail.

> +do
> +	for option in "" \
> +		      "--full-history" \
> +		      "--full-history --simplify-merges" \
> +		      "--simplify-merges" \
> +		      "--simplify-by-decoration" \
> +		      "--follow" \
> +		      "--first-parent" \
> +		      "--topo-order" \
> +		      "--date-order" \
> +		      "--author-date-order" \
> +		      "--ancestry-path side..master"
> +	do
> +		test_expect_success "git log option: $option for path: $path" '
> +			test_bloom_filters_used "$option -- $path"

All right, this tests that Bloom filters were used *and* that the
command run with Bloom filters and without Bloom filters (without
commit-graph) produces the same output.

> +		'
> +	done
> +done
> +
> +test_expect_success 'git log -- folder works with and without the trailing slash' '
> +	test_bloom_filters_used "-- A" &&
> +	test_bloom_filters_used "-- A/"
> +'

All right.

I wonder if we should test for insane test case, namely pathname to an
ordinary file that ends with slash:

  +	test_bloom_filters_used "-- file4" &&
  +	test_bloom_filters_used "-- file4/"

The latter should produce no output, being treated as not existing file.

> +
> +test_expect_success 'git log for path that does not exist. ' '
> +	test_bloom_filters_used "-- path_does_not_exist"
> +'

All right.

> +
> +test_expect_success 'git log with --walk-reflogs does not use bloom filters' '
> +	test_bloom_filters_not_used "--walk-reflogs -- A"
> +'

All right, but why is it so?

> +
> +test_expect_success 'git log -- multiple path specs does not use bloom filters' '
> +	test_bloom_filters_not_used "-- file4 A/file1"
> +'

All right, though this is limitation of current code, not limitation of
technique, so _maybe_ it would be better to test_expect_failure that for
multiple pathspecs bloom_filters_used...

> +
> +test_expect_success 'git log with wildcard that resolves to a single path uses bloom filters' '
> +	test_bloom_filters_used "-- *4" &&
> +	test_bloom_filters_used "-- *renamed"
> +'
> +
> +test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses bloom filters' '
> +	test_bloom_filters_not_used "-- *" &&
> +	test_bloom_filters_not_used "-- file*"
> +'

Same here.

> +
> +test_expect_success 'setup - add commit-graph to the chain without bloom filters' '
> +	test_commit c14 A/anotherFile2 &&
> +	test_commit c15 A/B/anotherFile2 &&
> +	test_commit c16 A/B/C/anotherFile2 &&
> +	GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split &&
> +	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
> +'
> +
> +test_expect_success 'git log does not use bloom filters if the latest graph does not have bloom filters.' '
> +	test_bloom_filters_not_used "-- A/B"
> +'

All right... though I would try to come up with a shorter test name :-)

> +
> +test_expect_success 'setup - add commit-graph to the chain with bloom filters' '
> +	test_commit c17 A/anotherFile3 &&
> +	git commit-graph write --reachable --changed-paths --split &&
> +	test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain
> +'
> +
> +test_bloom_filters_used_when_some_filters_are_missing() {
> +	log_args=$1
> +	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":6"

Perhaps a better solution would be to use (enhanced) 'test-tool bloom'
to check which commits have Bloom filters and which do not.

> +	setup "$log_args"
> +	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom
> +}

Why broken && chain between setup() and the resr, and why && is not
followed by line break (as before)?

> +
> +test_expect_success 'git log uses bloom filters if they exist in the latest but not all commit graphs in the chain.' '
> +	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
> +'
> +
> +test_done

All right... though the description of this test is a bit long.

Thank you for your work on this series.

Best,
-- 
Jakub Narębski