Re: Bug report : bad filter-branch (OSX only)

Jeff King <peff@xxxxxxxx> · Wed, 29 Apr 2015 11:48:58 -0400

On Tue, Apr 28, 2015 at 10:39:44PM -0700, Junio C Hamano wrote:

> Jeff King <peff@xxxxxxxx> writes:
> 
> > I'm not sure of a solution short of replacing the use of sed here with
> > something else. perl would be a simple choice, but filter-branch does
> > not otherwise depend on it. We could use a shell "read" loop, but those
> > are quite slow (and filter-branch is slow enough as it is!).
> 
> You need to only skip the header part, right?
> I would imagine that
> 
> (
> 	while read x && test -n "$x"
>         do
>         	:;
> 	done
> 	cat
> ) <../commit | eval "$filter_msg"
> 
> would not spin too much in shell loop, perhaps?

Yeah, that is not too bad. Probably we want "read -r", just in case of
weirdness in the header lines (and that's in POSIX, and we use it
in other scripts, so it should be portable enough). And we can save a
subshell if we don't mind the potential variable-name conflict.

Here's what I came up with.

-- >8 --
Subject: filter-branch: avoid passing commit message through sed

On some systems (like OS X), if sed encounters input without
a trailing newline, it will silently add it. As a result,
"git filter-branch" on such systems may silently rewrite
commit messages that omit a trailing newline. Even though
this is not something we generate ourselves with "git
commit", it's better for filter-branch to preserve the
original data as closely as possible.

We're using sed here only to strip the header fields from
the commit object. We can accomplish the same thing with a
shell loop. Since shell "read" calls are slow (usually one
syscall per byte), we use "cat" once we've skipped past the
header. Depending on the size of your commit messages, this
is probably faster (you pay the cost to fork, but then read
the data in saner-sized chunks). This idea is shamelessly
stolen from Junio.

Signed-off-by: Jeff King <peff@xxxxxxxx>
---
I confirmed the test fixes things on the OS X box I have access to.

The "probably faster" above is of course hand-waving. On my system
starting "cat" takes only about 40 syscalls, so that would naively imply
it's a win in for all but the shortest messages. But of course "fork()"
is a much more expensive syscall than "read()".

 git-filter-branch.sh     | 10 +++++++++-
 t/t7003-filter-branch.sh | 10 ++++++++++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index e6e99f5..5b3f63d 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -346,7 +346,15 @@ while read commit parents; do
 				die "parent filter failed: $filter_parent"
 	fi
 
-	sed -e '1,/^$/d' <../commit | \
+	{
+		while read -r header_line && test -n "$header_line"
+		do
+			# skip header lines...
+			:;
+		done
+		# and output the actual commit message
+		cat
+	} <../commit |
 		eval "$filter_msg" > ../message ||
 			die "msg filter failed: $filter_msg"
 	workdir=$workdir @SHELL_PATH@ -c "$filter_commit" "git commit-tree" \
diff --git a/t/t7003-filter-branch.sh b/t/t7003-filter-branch.sh
index 66643e4..855afda 100755
--- a/t/t7003-filter-branch.sh
+++ b/t/t7003-filter-branch.sh
@@ -394,4 +394,14 @@ test_expect_success 'replace submodule revision' '
 	test $orig_head != `git show-ref --hash --head HEAD`
 '
 
+test_expect_success 'filter commit message without trailing newline' '
+	git reset --hard original &&
+	commit=$(printf "no newline" | git commit-tree HEAD^{tree}) &&
+	git update-ref refs/heads/no-newline $commit &&
+	git filter-branch -f refs/heads/no-newline &&
+	echo $commit >expect &&
+	git rev-parse refs/heads/no-newline >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.4.0.rc3.477.gc25258d

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html