Re: [PATCH v2 2/2] sh-setup: explicitly mark CR as a field separator

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 26 Oct 2015 11:31:39 -0700

Matthieu Moy <Matthieu.Moy@xxxxxxxxxxxxxxx> writes:

> Johannes Schindelin <johannes.schindelin@xxxxxx> writes:
>
>> This is the correct thing to do, really: we already specify LF as
>> field separator.
>
> I'm almost convinced that this is the right thing to do in the long run
> ("almost" because I'm not sure, not because I have arguments against). I
> agree with Junio that the commit message should be more convincing, but
> indeed, accepting LF and not CR is strange.

If there were a single character that denotes CRLF, I'd say that
including such a character in IFS would make sense on a system with
CRLF EOL convention.  But that is not the case.

On a platform with LF EOL convention, having LF in IFS makes sense,
due to two reasons.

 * read is not the only user of IFS.  Expressing "list of things"
   (pre bashism "shell array" days) by concatenating elements into a
   single string variable, separated with LF, and later iterating
   over them is a very common use case, e.g.

	LF='
        '
	list="$thing1"
	list="$list$LF$thing2"
	list="$list$LF$thing3"
        ...

	IFS=$LF
	for thing in $list
        do
        	...

   And including LF by default in IFS, especially when "things" can
   contain SP/HT, is handy.

 * It does not hurt on a platform with LF EOL convention to include
   LF in IFS, because you cannot have a "lone LF" in the middle of a
   line.  Presence of a single LF alone will terminate the current
   line and the bytes that follow it will be a next line.

No similarity argument can be made for a lone CR on a platform that
uses CRLF EOL convention.  A "lone CR" can appear in the middle of a
line without terminating the current line, as only a CR that is
immediately followed by a LF is the end of line, so you cannot make
the "It does not hurt" argument for including CR to IFS.  If you
have a variable in which "A^MB" is there, "set $variable" would
split them into two and assign B to $2, which is not what the
scripts would expect.

> However, is this the right thing to do in the maintainance branch? It
> does fix the issue, but does so in a rather intrusive way, so I'd need
> more arguments to be convinced that this is safe to merge in maint.

If this were the "right" thing in general for shell scripts on
systems with CRLF EOL convention, the implementation of the shell
language on such a platform would be doing that upon startup (or
upon "unset IFS"), and we wouldn't be having this discussion.  Don't
you think it is strange that individual applications like Git have
to set IFS to include CR?  I see it as a strong sign that this is
not a correct thing to do.

Intrusive does not begin to describe it.

I think the "right" thing for a shell implementation on a CRLF
platform to do is twofold.

 * Read
   http://pubs.opengroup.org/onlinepubs/9699919799/utilities/read.html
   and notice that the operation is defined as a two-step process.
   It is supposed to read an input line and then "The terminating
   <newline> (if any) shall be removed from the input".  And then
   "the result shall be split into fields" (at IFS boundaries).

   As POSIX does not say <newline> has to be a single octet with
   value 10 (the same way you are allowed to show CRLF when you call
   printf("\n")), a shell implementation would read CRLF terminated
   lines, and remove the terminating CRLF before doing the field
   splitting using IFS.

   This is why I said in my message to Dscho that IFS does not get
   into the picture.

 * Implement the "field splitting" (at IFS boundaries) with a small
   twist.  When the script/user included '\n' in IFS, split at LF
   but remove CR if there is one that immediately precedes the LF.

The latter would help if you did the previous "read is not the only
user of IFS" example like this instead:

        LF=$(printf '\n.')
        LF=${LF%.}
	list="$thing1"
	list="$list$LF$thing2"
	list="$list$LF$thing3"
        ...

	IFS=$LF
	for thing in $list
        do
        	...

and the platform 'printf' gave CRLF for "\n" (newline), resulting
the list to be concatenated with CRLF, not LF.

And this is why I asked Dscho if the shell is done correctly _for_
the platform.

> Sorry for being negative, and especially sorry since I'm partly guilty
> for the breakage. I just want to be sure that we don't break anything
> while repairing it (we already introduced this breakage while repairing
> another one...).

Something along the line of the following would be tolerable, even
though I think in the longer term, not just in Git land but in the
larger ecosystem to use POSIXy tools on Windows, the best solution
is to fix the shell so that it matches the expectation of the users
of its platform.

I say "something along the line of" here because I do not know how
the problematic shell behaves on the assignment command that stuffs
a lone CR into a variable.  It _might_ need a similar protection
against the shell feature "the last EOL is removed from the result
of command expansion", as I did in the above example, depending on
how incoherent the implementation is.  The implementation seems to
accept CRLF and LF in shell scripts proper just fine, but apparently
its implementation of "read" does not honor such platform EOL
convention, which caused this issue, and I don't know what it does
in the codepath that implements command expansion.

 git-rebase--interactive.sh | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh
index c42ba34..8f13a35 100644
--- a/git-rebase--interactive.sh
+++ b/git-rebase--interactive.sh
@@ -881,6 +881,7 @@ check_commit_sha () {
 check_bad_cmd_and_sha () {
 	retval=0
 	lineno=0
+	cr=$(printf "%c" 13)
 	while read -r command rest
 	do
 		lineno=$(( $lineno + 1 ))
@@ -888,6 +889,10 @@ check_bad_cmd_and_sha () {
 		"$comment_char"*|''|noop|x|exec)
 			# Doesn't expect a SHA-1
 			;;
+		"$cr")
+			# Windows port of shell not stripping the newline
+			# at the end of an empty line correctly.
+			;;
 		pick|p|drop|d|reword|r|edit|e|squash|s|fixup|f)
 			if ! check_commit_sha "${rest%%[ 	]*}" "$lineno" "$1"
 			then
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html