On 4/15/2020 9:27 AM, Derrick Stolee wrote: > > I'm not against the idea. Logically, collapsing case before hashing > the Bloom keys should not increase the probabilities of false > positives except in the situations where we have case conflicts. > There is a small cost in the pre-hashing step to change the case of > the paths, but that should be much lower than the cost of the hash > itself and the tree parsing to find the changed paths. > >> It all depends on how often people want :(icase) pathspec matches in >> the history, I suspect. My point was that we need to declare that >> :(icase) won't matter in real life (hence we won't optimize our data >> to support that use case), before the way in which the data stored >> in the graph file is computed is cast in stone. > > My earlier statement can be summarized as "we could make this happen" > and you ask here "is it worth doing?" > > I will play around with how complicated the change would be while > the community considers the "is it worth doing?" question. I played a bit, and it wasn't terrible to cover everything by adjusting fill_bloom_key(). I also added a test that we will want anyways. There is more work to be done if this is the direction we want to go, including double-checking that this doesn't cause significant performance degradation. The point of me sending this patch is to make the proposed direction very clear how it would work. I'm still not sure how I feel about it. Thanks, -Stolee -->8-- >From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001 From: Derrick Stolee <dstolee@xxxxxxxxxxxxx> Date: Wed, 15 Apr 2020 18:04:25 +0000 Subject: [PATCH] bloom: compute all Bloom hashes from lowercase The changed-path Bloom filters currently hash path strings using the exact string for the path. This makes it difficult* to use the filters when restricting to case-insensitive pathspecs. * I say "difficult" because it is possible to generate all 2^n options for the case of a path and test them all, but this is a bad idea and should not be done. "Impossible" is an appropriate alternative. THIS IS A BREAKING CHANGE. Commit-graph files with changed-path Bloom filters computed by a previous commit will not be compatible with the filters computed in this commit, nor will we get correct results when testing across these incompatible versions. Normally, this would be a completely unacceptable change, but the filters have not been released and hence are still possible to update before release. TODO: If we decide to move in this direction, then the following steps should be done (and some of them should be done anyway): * We need to document the Bloom filter format to specify exactly how we compute the filter data. The details should be careful enough that someone can reproduce the exact file format without looking at the C code. * That document would include the tolower() transformation that is being done here. Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> --- bloom.c | 16 ++++++++++++++-- revision.c | 5 +++-- t/t6131-pathspec-icase.sh | 18 ++++++++++++++++++ 3 files changed, 35 insertions(+), 4 deletions(-) diff --git a/bloom.c b/bloom.c index dd9bab9bbd..c919c60f12 100644 --- a/bloom.c +++ b/bloom.c @@ -130,8 +130,20 @@ void fill_bloom_key(const char *data, int i; const uint32_t seed0 = 0x293ae76f; const uint32_t seed1 = 0x7e646e2c; - const uint32_t hash0 = murmur3_seeded(seed0, data, len); - const uint32_t hash1 = murmur3_seeded(seed1, data, len); + uint32_t hash0, hash1; + static struct strbuf icase_data = STRBUF_INIT; + char *cur; + + strbuf_setlen(&icase_data, 0); + strbuf_addstr(&icase_data, data); + + for (cur = icase_data.buf; cur && *cur; cur++) { + if (isupper(*cur)) + *cur = tolower(*cur); + } + + hash0 = murmur3_seeded(seed0, icase_data.buf, len); + hash1 = murmur3_seeded(seed1, icase_data.buf, len); key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t)); for (i = 0; i < settings->num_hashes; i++) diff --git a/revision.c b/revision.c index f78c636e4d..a02be25feb 100644 --- a/revision.c +++ b/revision.c @@ -652,13 +652,14 @@ static void trace2_bloom_filter_statistics_atexit(void) static int forbid_bloom_filters(struct pathspec *spec) { + int allowed_flags = PATHSPEC_LITERAL | PATHSPEC_ICASE; if (spec->has_wildcard) return 1; if (spec->nr > 1) return 1; - if (spec->magic & ~PATHSPEC_LITERAL) + if (spec->magic & ~allowed_flags) return 1; - if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL)) + if (spec->nr && (spec->items[0].magic & ~allowed_flags)) return 1; return 0; diff --git a/t/t6131-pathspec-icase.sh b/t/t6131-pathspec-icase.sh index 39fc3f6769..ee0766bd91 100755 --- a/t/t6131-pathspec-icase.sh +++ b/t/t6131-pathspec-icase.sh @@ -106,4 +106,22 @@ test_expect_success '"git diff" can take magic :(icase) pathspec' ' test_cmp expect actual ' +test_expect_success '"git log" takes the magic :(icase) pathspec' ' + cat >expect <<-\EOF && + FOO/BAR + FOO/bAr + FOO/bar + fOo/BAR + fOo/bAr + fOo/bar + foo/BAR + foo/bAr + foo/bar + EOF + git log --pretty=%s HEAD -- ":(icase)foo/bar" >actual && + test_cmp expect actual && + git log --pretty=%s HEAD -- ":(icase)foo" >actual && + test_cmp expect actual +' + test_done