Re: git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Richard,

Richard MICHAEL wrote:
>>Richard MICHAEL wrote:

>>> I am filtering our repo with git-filter-branch, but as the sed
>>> script runs with LANG=C LC_ALL=C (7 bit US ASCII), it dies on
>>> commits authored by our team members with accented names.
[...]
> What about special casing the bad sed (or whitelisting good sed)?
> Surely a hack, but would those of us with GNU or BSD would be happy.
> Which was the troublesome sed?

Sorry for the slow response.  The problematic sed is GNU sed from
MacPorts (I think).  Even with LC_ALL=C, .* no longer matches
arbitrary sequences of bytes with such sed: you can check yours with

 $ echo 'étale' | LC_ALL=C sed 's/.*//'

Unfortunately I have not been able to reproduce it on Linux.  Debian
sed 4.2.1-7 and GNU sed v4.2.1-21-gc6d32f0 both produce the expected
result:

 $ echo 'étale' | LC_ALL=C sed 's/.*//'
 $

> Unfortunately, it
> doesn't "die" well either; the 'export' shell var fails but it keeps
> processing commits.

Hmm, that sounds like a bug indeed.  Here is what the start to a fix
might look like, but I stopped early because it there's quite a lot of
sed usage in git that expects to be able to process arbitrary data
with short, newline-terminated lines (regardless of encoding).

diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index 962a93b..34a5fa3 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -68,8 +68,8 @@ eval "$functions"
 # "author" or "committer
 
 set_ident () {
-	lid="$(echo "$1" | tr "[A-Z]" "[a-z]")"
-	uid="$(echo "$1" | tr "[a-z]" "[A-Z]")"
+	lid="$(echo "$1" | tr "[A-Z]" "[a-z]")" &&
+	uid="$(echo "$1" | tr "[a-z]" "[A-Z]")" &&
 	pick_id_script='
 		/^'$lid' /{
 			s/'\''/'\''\\'\'\''/g
@@ -90,9 +90,9 @@ set_ident () {
 
 			q
 		}
-	'
+	' &&
 
-	LANG=C LC_ALL=C sed -ne "$pick_id_script"
+	LANG=C LC_ALL=C sed -ne "$pick_id_script" &&
 	# Ensure non-empty id name.
 	echo "case \"\$GIT_${uid}_NAME\" in \"\") GIT_${uid}_NAME=\"\${GIT_${uid}_EMAIL%%@*}\" && export GIT_${uid}_NAME;; esac"
 }
@@ -322,9 +322,11 @@ while read commit parents; do
 	git cat-file commit "$commit" >../commit ||
 		die "Cannot read commit $commit"
 
-	eval "$(set_ident AUTHOR <../commit)" ||
+	set_author=$(set_ident AUTHOR <../commit) &&
+	eval "$set_author" ||
 		die "setting author failed for commit $commit"
-	eval "$(set_ident COMMITTER <../commit)" ||
+	set_committer=$(set_ident COMMITTER <../commit) &&
+	eval "$set_committer" ||
 		die "setting committer failed for commit $commit"
 	eval "$filter_env" < /dev/null ||
 		die "env filter failed: $filter_env"
-- 
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]