Re: [PATCH 1/1] sparse-checkout: respect core.ignoreCase in cone mode

Derrick Stolee <stolee@xxxxxxxxx> · Wed, 11 Dec 2019 15:29:38 -0500

On 12/11/2019 3:00 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@xxxxxxxxx> writes:
> 
>> I'm trying to find a way around these two ideas:
>>
>> 1. The index is case-sensitive, and the sparse-checkout patterns are
>>    case-sensitive.
> 
> OK.  The latter is local to the repository and not shared to the
> world where people with case sensitive systems would live, right?

Yes, this information is local to a specific local copy. Even worktrees
have distinct sparse-checkout files.

>> 2. If a filesystem is case-insensitive, most index-change operations
>>    already succeed with incorrect case, especially with core.ignoreCase
>>    enabled.
> 
> I am not sure about "incorrect", though.  
> 
> My understanding is that modern case-insensitive systems are not
> information-destroying [*1*].  A new file you create as "ReadMe.txt"
> on your filesystem would be seen in that mixed case spelling via
> readdir() or equivalent, so adding it to the index as-is would
> likely be in the "correct" case, no?  If, after adding that path to
> the index, you did "rm ReadMe.txt" and created "README.TXT", surely
> we won't have both ReadMe.txt and README.TXT in the index with
> ignoreCase set, and keep using ReadMe.txt that matches what you
> added to the index.  I am not sure which one is the "incorrect" case
> in such a scenario.

The specific example I have in my mind is that the filesystem can have
"README.TXT" as what it would report by readdir(), but a user can type

	git add readme.txt

Without core.ignoreCase, the index will store the path as "readme.txt",
possibly adding a new entry in the index next to "README.TXT". If
core.ignoreCase is enabled, then the case reported by readdir() is used.
(This works both for a tracked and untracked path.)

>> The approach I have is to allow a user to provide a case that does not
>> match the index, and then we store the pattern in the sparse-checkout
>> that matches the case in the index.
> 
> Yes, I understood that from your proposed log message and cover
> letter.  They were very clearly written.
> 
> But imagine that your user writes ABC in the sparse pattern file,
> and there is no abc anything in the index in any case permutations.
> 
> When you check out a branch that has Abc, shouldn't the pattern ABC
> affect the operation just like a pattern Abc would on a case
> insensitive system?
> 
> Or are end users perfectly aware that the thing in that branch is
> spelled "Abc" and not "ABC" (the data in Git does---it comes from a
> tree object that is case sensitive)?  If so, then the pattern ABC
> should not affect the subtree whose name is "Abc" even on a case
> insensitive system.

This is a case that is difficult to control for. We have no source
of truth (filesystem or index) at the time of the 'git sparse-checkout
set' command. I cannot think of a solution to this specific issue
without doing the more-costly approach on every unpack_trees() call.

I believe this case may be narrow enough to "document and don't-fix",
but please speak up if anyone thinks this _must_ be fixed.

The other thing I can say is that my current approach _could_ be
replaced in the future by the more invasive check in unpack_trees().
The behavior change would be invisible to users who don't inspect
their sparse-checkout files other than this narrow case.

Specifically, our internal customer is planning to update the
sparse-checkout cone based on files in the workdir, which means that
the paths are expected to be in the index at the time of the 'set'
call.

> I am not sure what the design of this part of the system expects out
> of the end user.  Perhaps keeping the patterns case insensitive and
> tell the end users to spell them correctly is the right way to go, I
> guess, if it is just the filesystem that cannot represente the cases
> correctly at times and the users are perfectly capable of telling
> the right and wrong cases apart.

My first reaction to this internal request was "just clean up your
data." The issue is keeping it clean as users are changing the data
often and the only current checks are whether the build passes (and
the build "sees" the files because the filesystem accepts the wrong
case).

The "git add" behavior made me think there was sufficient precedent
in Git to try this change.

I'm happy to follow the community's opinion in this matter, including
a hard "we will not correct for case in this feature" or "if you want
this then you'll pay for it in performance." In the latter case I'd
prefer a new config option to toggle the sparse-checkout case
insensitivity. That way users could have core.ignoreCase behavior without
the perf hit if they use clean input to the sparse-checkout feature.

> But then I am not sure why correcting misspelled names by comparing
> them with the index entries is a good idea, either.

Right, that seems like a line too far.

>> It sounds like you are preferring this second option, despite the
>> performance costs. It is possible to use a case-insensitive hashing
>> algorithm when in the cone mode, but I'm not sure about how to do
>> a similar concept for the normal mode.
> 
> I have no strong preference, other than that I prefer to see things
> being done consistently.  Don't we already use case folding hash
> function to ensure that we won't add duplicate to the index on
> case-insensitive system?  I suspect that we would need to do the
> same, whether in cone mode or just a normal sparse-checkout mode
> consistently.

Since the cone mode uses a hashset to match the paths to the patterns,
that conversion is possible. I'm not sure how to start making arbitrary
regexes be case-insensitive in the non-cone-mode case. Suggestions?

Thank you for the discussion. I was hoping to get feedback on this
approach, which is why this patch is isolated to only this issue.

Thanks,
-Stolee