Re: [PATCH 1/3] revision: complicated pathspecs disable filters

Derrick Stolee <stolee@xxxxxxxxx> · Wed, 15 Apr 2020 14:37:40 -0400

On 4/15/2020 9:27 AM, Derrick Stolee wrote:
>
> I'm not against the idea. Logically, collapsing case before hashing
> the Bloom keys should not increase the probabilities of false
> positives except in the situations where we have case conflicts.
> There is a small cost in the pre-hashing step to change the case of
> the paths, but that should be much lower than the cost of the hash
> itself and the tree parsing to find the changed paths.
> 
>> It all depends on how often people want :(icase) pathspec matches in
>> the history, I suspect.  My point was that we need to declare that
>> :(icase) won't matter in real life (hence we won't optimize our data
>> to support that use case), before the way in which the data stored
>> in the graph file is computed is cast in stone.
> 
> My earlier statement can be summarized as "we could make this happen"
> and you ask here "is it worth doing?"
> 
> I will play around with how complicated the change would be while
> the community considers the "is it worth doing?" question.

I played a bit, and it wasn't terrible to cover everything by
adjusting fill_bloom_key(). I also added a test that we will want
anyways.

There is more work to be done if this is the direction we want to
go, including double-checking that this doesn't cause significant
performance degradation.

The point of me sending this patch is to make the proposed
direction very clear how it would work. I'm still not sure how I
feel about it.

Thanks,
-Stolee

-->8--
>From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
Date: Wed, 15 Apr 2020 18:04:25 +0000
Subject: [PATCH] bloom: compute all Bloom hashes from lowercase

The changed-path Bloom filters currently hash path strings using
the exact string for the path. This makes it difficult* to use the
filters when restricting to case-insensitive pathspecs.

* I say "difficult" because it is possible to generate all 2^n
  options for the case of a path and test them all, but this is
  a bad idea and should not be done. "Impossible" is an appropriate
  alternative.

THIS IS A BREAKING CHANGE. Commit-graph files with changed-path
Bloom filters computed by a previous commit will not be compatible
with the filters computed in this commit, nor will we get correct
results when testing across these incompatible versions. Normally,
this would be a completely unacceptable change, but the filters
have not been released and hence are still possible to update
before release.

TODO: If we decide to move in this direction, then the following
steps should be done (and some of them should be done anyway):

* We need to document the Bloom filter format to specify exactly
  how we compute the filter data. The details should be careful
  enough that someone can reproduce the exact file format without
  looking at the C code.

* That document would include the tolower() transformation that is
  being done here.

Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
---
 bloom.c                   | 16 ++++++++++++++--
 revision.c                |  5 +++--
 t/t6131-pathspec-icase.sh | 18 ++++++++++++++++++
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/bloom.c b/bloom.c
index dd9bab9bbd..c919c60f12 100644
--- a/bloom.c
+++ b/bloom.c
@@ -130,8 +130,20 @@ void fill_bloom_key(const char *data,
 	int i;
 	const uint32_t seed0 = 0x293ae76f;
 	const uint32_t seed1 = 0x7e646e2c;
-	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
-	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
+	uint32_t hash0, hash1;
+	static struct strbuf icase_data = STRBUF_INIT;
+	char *cur;
+
+	strbuf_setlen(&icase_data, 0);
+	strbuf_addstr(&icase_data, data);
+
+	for (cur = icase_data.buf; cur && *cur; cur++) {
+		if (isupper(*cur))
+			*cur = tolower(*cur);
+	}
+
+	hash0 = murmur3_seeded(seed0, icase_data.buf, len);
+	hash1 = murmur3_seeded(seed1, icase_data.buf, len);
 
 	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
 	for (i = 0; i < settings->num_hashes; i++)
diff --git a/revision.c b/revision.c
index f78c636e4d..a02be25feb 100644
--- a/revision.c
+++ b/revision.c
@@ -652,13 +652,14 @@ static void trace2_bloom_filter_statistics_atexit(void)
 
 static int forbid_bloom_filters(struct pathspec *spec)
 {
+	int allowed_flags = PATHSPEC_LITERAL | PATHSPEC_ICASE;
 	if (spec->has_wildcard)
 		return 1;
 	if (spec->nr > 1)
 		return 1;
-	if (spec->magic & ~PATHSPEC_LITERAL)
+	if (spec->magic & ~allowed_flags)
 		return 1;
-	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
+	if (spec->nr && (spec->items[0].magic & ~allowed_flags))
 		return 1;
 
 	return 0;
diff --git a/t/t6131-pathspec-icase.sh b/t/t6131-pathspec-icase.sh
index 39fc3f6769..ee0766bd91 100755
--- a/t/t6131-pathspec-icase.sh
+++ b/t/t6131-pathspec-icase.sh
@@ -106,4 +106,22 @@ test_expect_success '"git diff" can take magic :(icase) pathspec' '
 	test_cmp expect actual
 '
 
+test_expect_success '"git log" takes the magic :(icase) pathspec' '
+	cat >expect <<-\EOF &&
+	FOO/BAR
+	FOO/bAr
+	FOO/bar
+	fOo/BAR
+	fOo/bAr
+	fOo/bar
+	foo/BAR
+	foo/bAr
+	foo/bar
+	EOF
+	git log --pretty=%s HEAD -- ":(icase)foo/bar" >actual &&
+	test_cmp expect actual &&
+	git log --pretty=%s HEAD -- ":(icase)foo" >actual &&
+	test_cmp expect actual
+'
+
 test_done