Re: [PATCH v2 0/3] Teach ref-filter API to correctly handle CRLF in messages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"Philippe Blain via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> The function find_subpos in ref-filter.c looks for two consecutive '\n' to
> find the end of the subject line, a sequence which is absent in messages
> using CRLF. This results in the whole message being parsed as the subject
> line (%(contents:subject)), and the body of the message (%(contents:body))
> being empty.

To be honest, I suspect that it is not a bug in the parser that
parsed out %(contents:subject), but a user error that left the log
message in CRLF endings ;-).

So "correctly handle CRLF" is probably a tad unfair to those who
wrote the current ref-filter code; a description that is more fair
to them is probably along the lines of "handle malformed log
messages more gracefully", I would think.

> Moreover, in copy_subject, '\n' is replaced by space, but '\r' is untouched,
> resulting in the escape sequence '^M' being output verbatim in most terminal
> emulators:
>
> $ git branch --verbose
> * crlf    2113b0e Subject first line^M ^M Body first line^M Body second line
>
> This bug is a regression for git branch --verbose, which bisects down to
> 949af06 (branch: use ref-filter printing APIs, 2017-01-10).

I am not sure where you want to go with this.  Whether it is shown
in the ^X notation (and some terminals even reverse color to
highlight them), or it is shown literally (i.e. causing the next
byte to overwrite the same line starting from the left-edge), you
would be annoyed either way, no?  I suspect that the latter would
annoy you even more.  Isn't what "most terminal emulators" do,
i.e. to show it in the ^X notation instead of emitting it literally,
a good thing?  IOW, "resulting in ..." is not correctly telling us
what you think is wrong---you don't have to blame terminals.

It is not limited to CR, and is not limited to control characters at
the end of the lines, no?  If you had "\a" (or "\r") in the middle
of the title, either the current or the old code would ring a bell
(or cause the next character to appear at the end of the same line)
or when piped to "less" you'd see "^G" (or "^M") in the liddle of
the line.

The old code used pretty.c::pretty_print_commit() mechanism;
pretty.c::format_subject() uses pretty.c::is_blank_line() to trim
whitespaces at the right end while trying to notice where the first
paragraph break is, so any whitespace at the end of first paragraph
break is removed, and each end of line got replaced by a SP, but it
did not do anything special to control characters in the middle of
the lines (and it didn't do anything to the control characters in
the middle of the line, either).  So while the old code happened to
cleanse CR at the end of the lines, it wasn't doing enough.

I think fixing _that_ is (and should be) outside the scope of this
series, of course.

>  2:  c68bc2b3788 ! 2:  aab1f45ba97 ref-filter: teach the API to correctly handle CRLF
>      @@ -1,26 +1,49 @@
>       Author: Philippe Blain <levraiphilippeblain@xxxxxxxxx>
>       
>      -    ref-filter: teach the API to correctly handle CRLF
>      +    ref-filter: fix the API to correctly handle CRLF

API is not changed (i.e. the callers do not have to do anything
special); only the implementation.

	ref-filter: handle CR at the end of the lines more gracefully

perhaps?

>           The ref-filter API does not correctly handle commit or tag messages that
>           use CRLF as the line terminator. Such messages can be created with the
>           `--verbatim` option of `git commit` and `git tag`, or by using `git
>           commit-tree` directly.
>       
>      +    This impacts the output `git branch`, `git tag` and `git for-each-ref`
>      +    when used with a `--format` argument containing the atoms
>      +    `%(contents:subject)` or `%(contents:body)`, as well as the output of
>      +    `git branch --verbose`, which uses `%(contents:subject)` internally.

In other words...

	When a commit or a tag object uses CRLF line endings, the
	ref-filter machinery does not identify the end of the first
	paragraph as intended by the writer, because it only looks
	for two consecutive LFs and CR-LF-CR-LF does not look like a
	blank line that separates paragraphs to it.  "git branch",
	"git tag" and "git for-each-ref" all rely on the messages
	split correctly into "%(contents:subject)" and
	"%(contents:body)" placeholders and ends up showing
	everything as the subject.

Now based on what I hinted in the far-above part, there can be two
valid solutions here.

 * recognize CRLF as a valid line ending, but still retain ^M in the
   message.  The replacement for "%(contents:subject)" would still
   end with "^M", and we add LF to it, which makes the resulting
   output end with CRLF and all is well.  This will keep "\a" and
   "\r" in the middle of the line in the output.

 * strip CR and any control character other than LF from everywhere.
   This will cleanse "\a" and "\r" in the middle of, or anywhere on,
   the line, so that "%(contents:subject)", "%(contents:body)" and
   "%(contents)" all are "clean".

I am not offhand sure which one is better (I haven't read the patch
to see which one you chose to implement).

>      +    The function find_subpos in ref-filter.c looks for two consecutive '\n'
>      +    to find the end of the subject line, a sequence which is absent in
>      +    messages using CRLF. This results in the whole message being parsed as
>      +    the subject line (`%(contents:subject)`), and the body of the message
>      +    (`%(contents:body)`)  being empty.



>      +    Moreover, in copy_subject, '\n' is replaced by space, but '\r' is
>      +    untouched, resulting in the escape sequence '^M' being output verbatim
>      +    in most terminal emulators:
>      ...
>      +    This bug is a regression for `git branch --verbose`, which
>      +    bisects down to 949af0684c (branch: use ref-filter printing APIs,
>      +    2017-01-10).
>      +
>      +    Fix this bug in ref-filter by hardening the logic in `copy_subject` and
>      +    `find_subpos` to correctly parse messages containing CRFL.

The above few lines may need revising (based on what I said to the
cover); --- even if they don't, CRFL here needs to become CRLF ;-)

Thanks for working on this.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux