[RFC/PATCH] fix "git diff" to create wrong UTF-8 text

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I met a problem in patch text from "git diff" for UTF-8 text.
Patch text following to "@@" sometimes cut the string with max
80bytes. In case of UTF-8 text written in Japanese and English, most
of Japanese character are consist of 3 bytes for a character and also
ASCII character is single byte.
So, cut the string with 80bytes may cause cut off 1 or 2 byte for a
character at the bottom. This will cause the broken code of result of
"git diff".

It seems no problem to read such patch text for the patch command but
the problem is not readable for me. ie. Emacs cannot handle the
encoding for such file and show me octal numbers.

The patch below is my quick and dirty solution (but It works fine !)
I tested this patch with using Linux kernel document
(Documentation/ja_JP/HOWTO)
I believe this should be work for another language using UTF-8 and
solve this issue.

Please note that this is focused only for UTF-8 but we may need to
support another encoding.
So, How can we turn on this UTF-8 processing?
Any suggestions are welcome.

Thanks,

Sigined-off-by: Tsugikazu Shibata <tshibata@xxxxxxxxxxxxx>
---

diff -upr git-1.5.3.7/xdiff/xutils.c git-1.5.3.7-dev/xdiff/xutils.c
--- git-1.5.3.7/xdiff/xutils.c	2007-12-02 06:21:12.000000000 +0900
+++ git-1.5.3.7-dev/xdiff/xutils.c	2007-12-31 01:30:51.000000000 +0900
@@ -332,6 +332,32 @@ long xdl_atol(char const *str, char cons
 }


+/* return utf character size of bytes */
+int utf8charsize(const unsigned char c) {
+	int l;
+	if ( c < 0x7f ) l = 1;
+	else if (( c > 0xc0) && ( c < 0xdf)) l=2;
+	else if (( c > 0xe0) && ( c < 0xef)) l=3;
+	else if (( c > 0xf0) && ( c < 0xf7)) l=4;
+	else if (( c > 0xf8) && ( c < 0xfb)) l=5;
+	else if (( c > 0xfc) && ( c < 0xfd)) l=6;
+	else l=1; /* fale safe */
+	return l;
+}
+
+int utf8width(const char *up, int len) {
+        int cs;
+        int l=len;
+        const char *p = up;
+        while ((l > 0) && (p[0] != '\0')) {
+		cs = utf8charsize(p[0]);
+		if (l >= cs) {
+			l -= cs; p += cs;
+		} else l=0; /* do not split multi byte char. */
+        }
+        return p-up;
+}
+
 int xdl_emit_hunk_hdr(long s1, long c1, long s2, long c2,
 		      const char *func, long funclen, xdemitcb_t *ecb) {
 	int nb = 0;
@@ -368,6 +394,7 @@ int xdl_emit_hunk_hdr(long s1, long c1,
 		buf[nb++] = ' ';
 		if (funclen > sizeof(buf) - nb - 1)
 			funclen = sizeof(buf) - nb - 1;
+		funclen = utf8width(func, funclen);
 		memcpy(buf + nb, func, funclen);
 		nb += funclen;
 	}

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux