On Tue, 20 Apr 2010, Junio C Hamano wrote:
"Henrik Grubbström (Grubba)" <grubba@xxxxxxxxxx> writes:
When the conversion filter for a file is changed, the file may get
listed as modified even though the user has not made any changes to it.
This patch makes the index ignore such changes. It also makes git-diff
compare with the normalized content rather than the original content.
Hmm, I am not happy with this. A typical use case I am imagining goes
like this:
0. You have a project with LF line ending. You clone to a filesystem
that needs autocrlf but somehow it is not set, and end up with files
with LF line ending in your working tree.
Ok, but note that LF is the canonical line ending in git.
1. You notice the mistake, and set autocrlf. "git diff" does not say
anything, as the index is clean.
Ok.
2. Once you fixed the line endings in the working tree files, however,
"git diff" will say the files are different, but there is no actual
change (i.e. you see "diff --git a/file b/file" and nothing else).
Ok.
3. "git update-index --refresh" does not improve the situation, as it
(thinks) it knows the blob and the working tree file are different.
False. "git update-index --refresh" uses read-cache.c:ce_compare_data() to
compare the content of the blob with the normalized content of the working
tree, which both now and with my patch will return true (since the blob in
the repository already is normalized with respect to CRLF).
Let's take the reverse case instead:
0. For some reason a file using CRLF line endings has entered the
repository.
1. The user notices the mistake, and sets crlf. The index is still
clean, but the user wants the file with LF line endings, so the
user does a "git checkout -- the_file".
2. The index is now dirty, so the user performs a "git update-index
--refresh".
Unpatched:
update-index had no effect, since the normalized working tree
file is compared with the unnormalized repository file. The
user will have a dirty working tree until either the normalized
file is committed, or the attribute change is reverted.
Patched:
update-index will compare with the normalized repository file,
which will be equal to the normalized working tree file. The
user now has a clean working tree.
I was hoping to see a solution where you will add a stronger version of
"refresh" without having to do anything else other than recording "how did
I munge the file in the working tree to produce the blob". The third step
would change to:
3. "git update-index --refresh" notices that the conversion parameters
are different since the last time the files in the working tree were
looked at (i.e. immediately after a "clone", working tree files are
what git wrote out using convert_to_working_tree() and you know what
conversion you used; after the user modified files in the working tree
and said "git add", you know you what conversion parameters you ran
convert_to_git() with to produce blobs). The paths that has different
conversion parameters are re-indexed to see if they hash to the same
sha1 as recorded in the index. If they have changed, their index
entries are left intact (i.e. you will still show the differences);
otherwise you update the cached stat information for their index
entries.
I believe that most people that have edited a file that has changed CRLF
convention aren't interested in that all lines have changed, but in only
the lines that have actually been edited.
The above example scenario is about crlf conversion, but the same idea
should apply to other types of conversions (e.g. smudge/clear filter
pair), no?
Yes, the same should apply to other conversions.
I can see that it would be benefitial to store what conversions were used
to turn the input into the canonical version that resulted in the object
store and registered in the index, but I am not sure why the re-indexed
versions need to be even stored in the index (either in-core, let alone
on-disk) nor produce new blob objects. What am I missing?
Storing the normalized sha1 is needed to reduce the amount of double work
(eg having "git update-index --refresh" reperform convert_to_git() for
the repository blob every time a file is dirty, instead of (as now) just
comparing the sha1 values).
Storing the normalized blob data helps reduce (or rather not increase)
code complexity in eg "git diff", since the only change to the code is
that a different blob sha1 is used.
The stored normalized blobs should take minimal space in the repository,
since:
1). The most common case is most likely that the normalized blob is the
same as the original blob (otherwise somebody else would have done
this before).
2). The normalized blob will probably often be stored as a delta, either
against the original blob (eg ident), or backward against the next
commit of the file (eg crlf).
3). If the normalized blob isn't used in deltas, it will sooner or later
be cleaned up by the gc.
--
Henrik Grubbström grubba@xxxxxxxxxx
Roxen Internet Software AB grubba@xxxxxxxxx