Re: [PATCH 1/3] revision: complicated pathspecs disable filters

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 14 Apr 2020 11:25:40 -0700

Derrick Stolee <stolee@xxxxxxxxx> writes:

>> But if users may use icase pathspec very often, it may be worth
>> considering to build the bloom filter after downcasing the paths,
>> perhaps?  Given that many projects extract their source code to a
>> case insensitive filesystem, I would imagine that downcasing paths
>> would map two originally different paths into the same thing only
>> rarely, if ever, so there may not be much downside to do so.
>
> This behavior could be extended later, and carefully. My initial
> thought was that the case check would happen on every commit. If
> the :(icase) check only happens at the walk tip(s), then we could
> compute a single Bloom key at the start.

Sorry, I am not sure what you mean.

Do you mean that we notice that the user wants to match 'foo' case
insensitively, and tell the logic that uses changed-path records in
the graph file that commits that cannot possibly have touched any or
the paths 'foo', 'foO', 'fOo', ... (all 8 case permutations) are not
interesting?

I guess that would work, but I was wondering if it is simpler
without much downside if the changed-path records in the graph file
are prepared on paths after they are normalized to a single case.
That would lose information (e.g. you no longer can say "commits
that touch the path 'foo' is interesting, but those that touch the
path 'Foo' are not"), but makes the side that queries much simpler
(i.e. you do not have to prepare all 8 case permutations---you only
ask about 'foo').

And because the Bloom filter is used only for performance to cull
commits that can never possibly match, allowing a false positive
that would be discarded by actually running tree-diff anyway, the
only potential downside happens when the project has too many paths
that are different only in cases by increased collisions and by
reducing our chances to skip running tree-diff (and never affects
correctness).  

But this is not the "could be extended later" kind of behaviour, I
am afraid.  It is baked in the data stored in the graph file.

It all depends on how often people want :(icase) pathspec matches in
the history, I suspect.  My point was that we need to declare that
:(icase) won't matter in real life (hence we won't optimize our data
to support that use case), before the way in which the data stored
in the graph file is computed is cast in stone.

Thanks.