Re: [PATCH RFC 5/5] cache: Use ce_norm_sha1().

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 20 Apr 2010 12:12:48 -0700

Henrik Grubbström <grubba@xxxxxxxxx> writes:

>> 3. "git update-index --refresh" does not improve the situation, as it
>>    (thinks) it knows the blob and the working tree file are different.
>
> False. "git update-index --refresh" uses
> read-cache.c:ce_compare_data() to compare the content of the blob with
> the normalized content of the working tree,...

I don't think you tried it yourself.  Here is what should happen with the
current code.

	# step 0 & 1. a project with LF ending
	$ git init two && cd two
        $ echo a quick brown fox >kuzu
        $ git add kuzu && git commit -m kuzu

        # step 2. you want CRLF in your work area
        $ echo -e "a quick brown fox\015" >kuzu
        $ git config core.autocrlf true
	$ git diff
        diff --git a/kuzu b/kuzu

        # step 3. oops, refresh
        $ git update-index --refresh
        kuzu: needs update

And it is a common thing people wish to do.  Admittedly, this is an one-off
event, so it is not _that_ urgent to fix.  You can for example do:

	# step 4. you haven't changed anything really, so you can add
        $ git add kuzu
        $ git diff
        $ git diff --cached ;# empty!

to force re-index.

> Let's take the reverse case instead:
>
>  0. For some reason a file using CRLF line endings has entered the
>     repository.

	# step 0. a blob with CRLF
        $ git init one && cd one
        $ echo -e "a quick brown fox\015" >kuzu
        $ git add kuzu && git commit -m kuzu

>  1. The user notices the mistake, and sets crlf. The index is still
>     clean, but the user wants the file with LF line endings, so the
>     user does a "git checkout -- the_file".

	# step 1. you want CRLF in work area, LF in repository
        $ git config core.autocrlf true
        $ git diff ;# clean!
        $ git checkout -- kuzu
        $ git diff ;# clean!
        $ cat -v kuzu
        a quick brown fox^M

One glitch is that this "checkout" becomes a no-op because the file is
stat-clean.  This is something your "record in the index entry what
normalization was used when we checked it out (or when we read and
hashed)" approach should be able to fix.  It however does not need the
re-indexed object name.

	Side note: if you want to have LF in both work tree and in
        repository, then you wouldn't use core.autocrlf.  Instead you
        would do:

	# step 1 (irrelevant alternative). you want LF in both
        $ dos2unix kuzu
        $ git diff ;# clean!

>  2. The index is now dirty, so the user performs a "git update-index
>     --refresh".

I think you see exactly the same behaviour in the example sequence I gave
you (blobs with LF with working files with CRLF, core.autocrlf set) and in
your example sequence (blobs with CRLF with working files with LF,
core.autocrlf set) in this case.  What happens to my example are already
shown above.  Continuing your example, because in reality the index is not
dirty, we would need to make it stat-dirty first.

	# step 2. you try to refresh
        $ touch kuzu
        $ git update-index --refresh
        kuzu: needs update
        $ git checkout -- kuzu
        $ cat -v kuzu
        a quick brown fox^M
        $ git diff ;# shows changes!
        diff --git a/kuzu b/kuzu
        index ....

If you are trying to somehow make this last "git diff" silent, then I
think you are solving a _wrong_ problem.  By setting retroactively the
CRLF setting, you are saying that you do not want to have CRLF in the
blobs recorded in the repository, and it is a _good thing_ that there are
differences (tons of them) between what is recorded currently and what you
are going to commit to fix the earlier mistake.

>> I was hoping to see a solution where you will add a stronger version of
>> "refresh" without having to do anything else other than recording "how did
>> I munge the file in the working tree to produce the blob".  The third step
>> would change to:
>>
>> 3. "git update-index --refresh" notices that the conversion parameters
>>    are different since the last time the files in the working tree were
>>    looked at (i.e. immediately after a "clone", working tree files are
>>    what git wrote out using convert_to_working_tree() and you know what
>>    conversion you used; after the user modified files in the working tree
>>    and said "git add", you know you what conversion parameters you ran
>>    convert_to_git() with to produce blobs).  The paths that has different
>>    conversion parameters are re-indexed to see if they hash to the same
>>    sha1 as recorded in the index.  If they have changed, their index
>>    entries are left intact (i.e. you will still show the differences);
>>    otherwise you update the cached stat information for their index
>>    entries.
>
> I believe that most people that have edited a file that has changed
> CRLF convention aren't interested in that all lines have changed, but
> in only the lines that have actually been edited.

I think that is not just solving a wrong problem but also is encouraging a
bad workflow at the same time, which is even worse.  If you recorded CRLF
blobs in a project that you do not want to have CRLF blob, fixing that
mistake is one logical change that people who later browse the history
would not want to see intermixed with the "lines that actually have been
edited".

> Storing the normalized sha1 is needed to reduce the amount of double
> work (eg having "git update-index --refresh" reperform
> convert_to_git() for
> the repository blob every time a file is dirty, instead of (as now)
> just comparing the sha1 values).

As I already said, I agree that it would be beneficial to store what
normalization settings were used and comparing that with what settings are
in effect to detect the possible phamtom difference caused by the change
of the settings.  But once we know that the result of a re-normalization
is different from what is recorded in the index (or tree), then the
difference should be shown.  The actual difference would change every time
the work tree file is edited, so I don't see the benefit of contaminate
the object database with intermediate "blobs" that is not "added".
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html