Re: [PATCH 1/3] revision: complicated pathspecs disable filters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/14/2020 2:25 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@xxxxxxxxx> writes:
> 
>>> But if users may use icase pathspec very often, it may be worth
>>> considering to build the bloom filter after downcasing the paths,
>>> perhaps?  Given that many projects extract their source code to a
>>> case insensitive filesystem, I would imagine that downcasing paths
>>> would map two originally different paths into the same thing only
>>> rarely, if ever, so there may not be much downside to do so.
>>
>> This behavior could be extended later, and carefully. My initial
>> thought was that the case check would happen on every commit. If
>> the :(icase) check only happens at the walk tip(s), then we could
>> compute a single Bloom key at the start.
> 
> Sorry, I am not sure what you mean.

That's my fault. There are a couple of things I misunderstood here.

1. Thinking about "git blame" we would need to "collapse" a pathspec
   to a specific file before starting history. But blame doesn't
   allow ":(icase)" anyway.

2. With that context of "git blame" in my head, I was thinking
   (incorrectly) that "git log" would collapse the pathspec based on
   what file(s) match the pattern at HEAD. The tests in
   t6131-pathspec-icase.sh clearly show that this is wrong. In fact,
   if we apply the following diff to this patch, then we can get failures
   with the changed-path filters:

diff --git a/revision.c b/revision.c
index f78c636e4d..a02be25feb 100644
--- a/revision.c
+++ b/revision.c
@@ -652,13 +652,14 @@ static void trace2_bloom_filter_statistics_atexit(void)
 
 static int forbid_bloom_filters(struct pathspec *spec)
 {
+       int allowed_flags = PATHSPEC_LITERAL | PATHSPEC_ICASE;
        if (spec->has_wildcard)
                return 1;
        if (spec->nr > 1)
                return 1;
-       if (spec->magic & ~PATHSPEC_LITERAL)
+       if (spec->magic & ~allowed_flags)
                return 1;
-       if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
+       if (spec->nr && (spec->items[0].magic & ~allowed_flags))
                return 1;
 
        return 0;

> Do you mean that we notice that the user wants to match 'foo' case
> insensitively, and tell the logic that uses changed-path records in
> the graph file that commits that cannot possibly have touched any or
> the paths 'foo', 'foO', 'fOo', ... (all 8 case permutations) are not
> interesting?
> 
> I guess that would work, but I was wondering if it is simpler
> without much downside if the changed-path records in the graph file
> are prepared on paths after they are normalized to a single case.
> That would lose information (e.g. you no longer can say "commits
> that touch the path 'foo' is interesting, but those that touch the
> path 'Foo' are not"), but makes the side that queries much simpler
> (i.e. you do not have to prepare all 8 case permutations---you only
> ask about 'foo').
> 
> And because the Bloom filter is used only for performance to cull
> commits that can never possibly match, allowing a false positive
> that would be discarded by actually running tree-diff anyway, the
> only potential downside happens when the project has too many paths
> that are different only in cases by increased collisions and by
> reducing our chances to skip running tree-diff (and never affects
> correctness).  
> 
> But this is not the "could be extended later" kind of behaviour, I
> am afraid.  It is baked in the data stored in the graph file.

Since the feature is not released, we still have time to update the
format if we so desired. With the current format, we would need to
disable the filters when using an :(icase) pathspec as the current
patch does.

I'm not against the idea. Logically, collapsing case before hashing
the Bloom keys should not increase the probabilities of false
positives except in the situations where we have case conflicts.
There is a small cost in the pre-hashing step to change the case of
the paths, but that should be much lower than the cost of the hash
itself and the tree parsing to find the changed paths.

> It all depends on how often people want :(icase) pathspec matches in
> the history, I suspect.  My point was that we need to declare that
> :(icase) won't matter in real life (hence we won't optimize our data
> to support that use case), before the way in which the data stored
> in the graph file is computed is cast in stone.

My earlier statement can be summarized as "we could make this happen"
and you ask here "is it worth doing?"

I will play around with how complicated the change would be while
the community considers the "is it worth doing?" question.

Thanks,
-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux