Re: whitespace ignoring during diff -M

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[this is a bit of an old message, but I am way behind on git mail,
 and nobody else seems to have responded, so...]

On Sun, May 31, 2009 at 10:28:50PM +0200, Daniel Mierswa wrote:

> I was told to try it here after visiting #git/Freenode
> I want git to think that the diff of two branches where filenames and
> whitespace amount differ are the same.
> The following is a snippet from my terminal with output, is there a
> chance to make git think that those are equal?

Rename detection in git does not respect the "-w" option at all. It
hashes each line of a text file, and then compares the hashes to see how
"similar" the files are.

It already makes some effort to ignore the CR in a CRLF sequence when
calculating the hash. So just running "unix2dos" (or vice versa) on a
file should still allow it to find renames.

This could probably be extended fairly trivially to ignore arbitrary
whitespace when generating the hash (I'm not sure if the feature should
be triggered by "-w" or not; it makes sense to me, but I'm not sure if
there are cases where people would want diff generation to have
different rules than rename detection. We maybe would even want to
ignore whitespace in diff generation _always_, as we always do already
with CRLF. Somebody would need to check the results of the two
approaches against a number of cases).

If you are interested, the relevant code is in hash_chars in
diffcore-delta.c. A trivial implementation would probably look something
like the patch below. I tested it with:

  git init
  cp /usr/share/dict/words words && git add words && git commit -m one
  sed 's/^/  /' <words >munged
  git add munged && git rm words
  git diff --cached --summary

which curious reports 82% similarity. So maybe there is more
investigation to be done. Anyway, patch below.

---

diff --git a/diffcore-delta.c b/diffcore-delta.c
index e670f85..63704da 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -145,6 +145,8 @@ static struct spanhash_top *hash_chars(struct diff_filespec *one)
 		/* Ignore CR in CRLF sequence if text */
 		if (is_text && c == '\r' && sz && *buf == '\n')
 			continue;
+		if (is_text && (c == ' ' || c == '\t'))
+			continue;
 
 		accum1 = (accum1 << 7) ^ (accum2 >> 25);
 		accum2 = (accum2 << 7) ^ (old_1 >> 25);
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]