"Philippe Blain via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > The function find_subpos in ref-filter.c looks for two consecutive '\n' to > find the end of the subject line, a sequence which is absent in messages > using CRLF. This results in the whole message being parsed as the subject > line (%(contents:subject)), and the body of the message (%(contents:body)) > being empty. To be honest, I suspect that it is not a bug in the parser that parsed out %(contents:subject), but a user error that left the log message in CRLF endings ;-). So "correctly handle CRLF" is probably a tad unfair to those who wrote the current ref-filter code; a description that is more fair to them is probably along the lines of "handle malformed log messages more gracefully", I would think. > Moreover, in copy_subject, '\n' is replaced by space, but '\r' is untouched, > resulting in the escape sequence '^M' being output verbatim in most terminal > emulators: > > $ git branch --verbose > * crlf 2113b0e Subject first line^M ^M Body first line^M Body second line > > This bug is a regression for git branch --verbose, which bisects down to > 949af06 (branch: use ref-filter printing APIs, 2017-01-10). I am not sure where you want to go with this. Whether it is shown in the ^X notation (and some terminals even reverse color to highlight them), or it is shown literally (i.e. causing the next byte to overwrite the same line starting from the left-edge), you would be annoyed either way, no? I suspect that the latter would annoy you even more. Isn't what "most terminal emulators" do, i.e. to show it in the ^X notation instead of emitting it literally, a good thing? IOW, "resulting in ..." is not correctly telling us what you think is wrong---you don't have to blame terminals. It is not limited to CR, and is not limited to control characters at the end of the lines, no? If you had "\a" (or "\r") in the middle of the title, either the current or the old code would ring a bell (or cause the next character to appear at the end of the same line) or when piped to "less" you'd see "^G" (or "^M") in the liddle of the line. The old code used pretty.c::pretty_print_commit() mechanism; pretty.c::format_subject() uses pretty.c::is_blank_line() to trim whitespaces at the right end while trying to notice where the first paragraph break is, so any whitespace at the end of first paragraph break is removed, and each end of line got replaced by a SP, but it did not do anything special to control characters in the middle of the lines (and it didn't do anything to the control characters in the middle of the line, either). So while the old code happened to cleanse CR at the end of the lines, it wasn't doing enough. I think fixing _that_ is (and should be) outside the scope of this series, of course. > 2: c68bc2b3788 ! 2: aab1f45ba97 ref-filter: teach the API to correctly handle CRLF > @@ -1,26 +1,49 @@ > Author: Philippe Blain <levraiphilippeblain@xxxxxxxxx> > > - ref-filter: teach the API to correctly handle CRLF > + ref-filter: fix the API to correctly handle CRLF API is not changed (i.e. the callers do not have to do anything special); only the implementation. ref-filter: handle CR at the end of the lines more gracefully perhaps? > The ref-filter API does not correctly handle commit or tag messages that > use CRLF as the line terminator. Such messages can be created with the > `--verbatim` option of `git commit` and `git tag`, or by using `git > commit-tree` directly. > > + This impacts the output `git branch`, `git tag` and `git for-each-ref` > + when used with a `--format` argument containing the atoms > + `%(contents:subject)` or `%(contents:body)`, as well as the output of > + `git branch --verbose`, which uses `%(contents:subject)` internally. In other words... When a commit or a tag object uses CRLF line endings, the ref-filter machinery does not identify the end of the first paragraph as intended by the writer, because it only looks for two consecutive LFs and CR-LF-CR-LF does not look like a blank line that separates paragraphs to it. "git branch", "git tag" and "git for-each-ref" all rely on the messages split correctly into "%(contents:subject)" and "%(contents:body)" placeholders and ends up showing everything as the subject. Now based on what I hinted in the far-above part, there can be two valid solutions here. * recognize CRLF as a valid line ending, but still retain ^M in the message. The replacement for "%(contents:subject)" would still end with "^M", and we add LF to it, which makes the resulting output end with CRLF and all is well. This will keep "\a" and "\r" in the middle of the line in the output. * strip CR and any control character other than LF from everywhere. This will cleanse "\a" and "\r" in the middle of, or anywhere on, the line, so that "%(contents:subject)", "%(contents:body)" and "%(contents)" all are "clean". I am not offhand sure which one is better (I haven't read the patch to see which one you chose to implement). > + The function find_subpos in ref-filter.c looks for two consecutive '\n' > + to find the end of the subject line, a sequence which is absent in > + messages using CRLF. This results in the whole message being parsed as > + the subject line (`%(contents:subject)`), and the body of the message > + (`%(contents:body)`) being empty. > + Moreover, in copy_subject, '\n' is replaced by space, but '\r' is > + untouched, resulting in the escape sequence '^M' being output verbatim > + in most terminal emulators: > ... > + This bug is a regression for `git branch --verbose`, which > + bisects down to 949af0684c (branch: use ref-filter printing APIs, > + 2017-01-10). > + > + Fix this bug in ref-filter by hardening the logic in `copy_subject` and > + `find_subpos` to correctly parse messages containing CRFL. The above few lines may need revising (based on what I said to the cover); --- even if they don't, CRFL here needs to become CRLF ;-) Thanks for working on this.