[PATCH v4 00/21] Peace with CRLF

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We have too many topics titled "War on something"; let's try to make
peace for a change.

This is a reroll of

  http://thread.gmane.org/gmane.comp.version-control.git/282614

which is a continuation to $gmane/275735, which is filed in the
archive under another mailing list:

  http://thread.gmane.org/gmane.comp.version-control.msysgit/21773

but with a larger clean-up whose early part was sent earlier, marked
with the "PREVIEW v3" label.

  http://thread.gmane.org/gmane.comp.version-control.git/284001

We read "text" files from filesystem (or from the standard input) in
various places in the code.  Some of them are written by us, but
others are often prepared by an editor driven by human user.  Even
though we write (and expect) lines in these text files to be
terminated with LF (not CRLF), the end-user's editor can be told to
use CRLF termination, and on some platforms that may be the default.

Typically, we use strbuf_getline() to read these "text" files, which
reads a single line up to the first LF into a buffer, discard that
LF at the end, and returns (an incomplete last line gets returned
as-is).  Lines from a file edited on Dossy systems hence may have
a CR at the end that the end-user did not intend to be a part of it.

Because many codepaths (e.g. commit log message) first pass the
contents of such a file through stripspace(), and as a side effect
of discarding the whitespace at the end of each line, lines
terminated with CRLF are normalized to LF terminated lines (but we
do not lose a lone CR in the middle of a non-blank string), when we
process what we read from these files, this is not a problem for
them, but not all of the codepaths are safe.

In these places that expect to read "text", we should be able to
update the logic to read a line up to the first LF, discard that LF,
together with a CR if one appears immediately before that LF,
without breaking anything.  That includes cases where the program
might want to treat a CR in the middle of a line as proper payload
the user intended to give us.  Such a variant of strbuf_getline()
already exists as strbuf_getline_crlf() and used in the "git am"
that was reimplemented in C recently.

The strbuf_getline() function takes a "line-termination" as its
third argument.  Its interface is overly broad in that it allows an
arbitrary byte to be used to mark the end of a line, but in
practice, no codepaths pass any byte other than LF or NUL.  In
theory, the callers that use LF as the line-termination are
expecting "text" input, so we could blindly replace them with
strbuf_getline_crlf() and not many things should break.  To be
conservative, however, this series converts only chosen ones after
auditing that such a conversion is safe.

To make the tree-wide change manageable, the early part of this
series first changes all callsites of strbuf_getline() to call one
of the two new thin wrapper functions, strbuf_getline_lf() and
strbuf_getline_nul(), depending on the value of "line-termination"
used by the caller.  This is done with Patches 1-9, which is
essentially the same as "PREVIEW v3" series posted earlier, with
small changes:

 - The update to mktree used "lf_lines" in the preview, but this
   version flips it to the same "nul_term_line" to match other two,
   even though I think the latter is a terrible variable name with
   bad taste.

 - The update to checkout-index kept "line_termination" in the
   preview, but this version no longer uses it and instead uses
   nul_term_line.

 - The definition of nul_term_line for the option parser uses
   OPT_BOOL(), as the possible values in it are true or false, not
   '\0' or '\n'.

 - The topic was rebased on v2.7.0; the bulk of builtin/mailinfo.c
   moved to mailinfo.c between the old base of the topic and v2.7.0,
   so the mechanical renaming done in Patch 3 was redone.

After Patch 9, all the original callers of strbuf_getline() that
used LF line ending call strbuf_getline_lf().  No caller calls
strbuf_getline() at that point, so strbuf_getline_crlf() takes over
the short and nice name strbuf_getline().

The callers of strbuf_getline_lf() are inspected by the remainder of
the series and updated to call strbuf_getline() instead when it
makes sense to do so.  They all had to be updated relative to the
previous round (v2), but the callsites that are updated haven't
changed.

This series converts only the callers of strbuf_getline() that would
misbehave when fed a file with CRLF-terminated lines and use the
data with an unwanted CR appended at the end.  With the update the
code should work as intended with such a file, without breaking the
expected behaviour when working on a file with LF-terminated lines.

Notes on the remaining strbuf_getline_lf() callers:

 * Those that expect to only read from our own output do not have to
   accommodate CRLF-terminated lines, but they can be updated to do
   so safely if we do not rely on the ability to express a payload
   that has CR at the end.  For example, the insn sheet "rebase -i"
   uses typically ends each line with a commit log summary that we
   ourselves do not read or use, so even if the end-user on a
   platform with LF lines deliberately does insert CR at the end of
   the line and strbuf_getline() removed that CR from the payload,
   no unexpected behaviour should happen.

   It may be a good GSoC microproject to convert them to use
   strbuf_getline().

 * Those that call strbuf_trim() immediately on the result before
   doing anything, or otherwise have logic that tolerates
   whitespaces at the end of the line, can continue using
   strbuf_getline_lf() and will not misbehave on CRLF-terminated
   lines.

   This series does not touch them; converting them to use
   strbuf_getline() would a good way to document them as dealing
   with "text".  While doing so, they can also lose their custom
   logic that happens to make CRLF-terminated lines work.

   It is a good GSoC microproject to convert them to use
   strbuf_getline().

 * Those that want to only ever see LF-terminated lines (in other
   words, "ABC\r\n" must be treated as a line with 4 bytes payload
   on it, A B C and CR), would be broken if we replace it with
   strbuf_getline().  This series does not touch them, and no
   follow-up series should, either.


Junio C Hamano (21):
  strbuf: miniscule style fix
  strbuf: make strbuf_getline_crlf() global
  strbuf: introduce strbuf_getline_{lf,nul}()
  mktree: there are only two possible line terminations
  check-attr: there are only two possible line terminations
  check-ignore: there are only two possible line terminations
  update-index: there are only two possible line terminations
  checkout-index: there are only two possible line terminations
  strbuf: give strbuf_getline() to the "most text friendly" variant
  hash-object: read --stdin-paths with strbuf_getline()
  revision: read --stdin with strbuf_getline()
  rev-parse: read parseopt spec with strbuf_getline()
  ident.c: read /etc/mailname with strbuf_getline()
  remote.c: read $GIT_DIR/remotes/* with strbuf_getline()
  clone/sha1_file: read info/alternates with strbuf_getline()
  transport-helper: read helper response with strbuf_getline()
  cat-file: read batch stream with strbuf_getline()
  column: read lines with strbuf_getline()
  send-pack: read list of refs with strbuf_getline()
  grep: read -f file with strbuf_getline()
  test-sha1-array: read command stream with strbuf_getline()

 bisect.c                   |  8 ++++----
 builtin/am.c               | 37 +++++++++++--------------------------
 builtin/cat-file.c         |  2 +-
 builtin/check-attr.c       |  7 ++++---
 builtin/check-ignore.c     |  7 ++++---
 builtin/check-mailmap.c    |  2 +-
 builtin/checkout-index.c   | 16 ++++++++--------
 builtin/clean.c            |  6 +++---
 builtin/clone.c            |  2 +-
 builtin/column.c           |  2 +-
 builtin/commit.c           |  2 +-
 builtin/fetch-pack.c       |  2 +-
 builtin/grep.c             |  2 +-
 builtin/hash-object.c      |  2 +-
 builtin/mktree.c           | 14 ++++++++------
 builtin/notes.c            |  2 +-
 builtin/pull.c             |  2 +-
 builtin/repack.c           |  2 +-
 builtin/rev-parse.c        |  4 ++--
 builtin/send-pack.c        |  2 +-
 builtin/update-index.c     | 27 ++++++++++++++++-----------
 compat/terminal.c          |  2 +-
 credential-cache--daemon.c |  4 ++--
 credential-store.c         |  2 +-
 credential.c               |  2 +-
 daemon.c                   |  2 +-
 fast-import.c              |  4 ++--
 ident.c                    |  2 +-
 mailinfo.c                 |  8 ++++----
 remote-curl.c              |  6 +++---
 remote-testsvn.c           |  4 ++--
 remote.c                   |  4 ++--
 revision.c                 |  9 ++-------
 sequencer.c                |  2 +-
 sha1_file.c                |  2 +-
 shell.c                    |  2 +-
 strbuf.c                   | 28 +++++++++++++++++++++++++---
 strbuf.h                   | 27 ++++++++++++++++++++++-----
 test-sha1-array.c          |  2 +-
 transport-helper.c         |  5 +++--
 walker.c                   |  2 +-
 wt-status.c                |  4 ++--
 42 files changed, 151 insertions(+), 122 deletions(-)

-- 
2.7.0-250-ge1b5ba3

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]