Re: [PATCH/RFC 0/3] Per-repository end-of-line normalization

Robert Buck <buck.robert.j@xxxxxxxxx> · Sat, 8 May 2010 07:36:43 -0400

[...]

>> character.
>
> Erm, this seems to be a counterexample to your point.  It says very
> clearly that the files can use either LF or CRLF line endings, and
> will be parsed correctly either way, or your parser is broken.  So
> pretty much any CRLF conversion rule (or none at all) will work with
> such files.

Perhaps I was not clear, or you did not understand my point.

Read "...by translating... to #xA", XSLT output to a file therefore
MUST be LF by definition for it to be canonical form. This is an
example of a TEXT file that MUST by definition of the spec be LF based
on all platforms. Looking at the "auto" code that exists in Git, it
does not appear to support this very obvious standard, whereby for
this "file-type" it should always be checked out of source control
with LF regardless of how it came in. This is equivalent to the Git
"input" setting I believe (?), but on a file-type basis. Yes, Git
apparently does not have the notion of file-types, does it (e.g. *.xml
maps to text)?

The point I am really trying to make clear is that there are multiple
dimensions to this problem, and not making that succinct will result
in a botched attempt. We need to carefully distinguish file-types from
other switches that control whether or not to perform automatic
conversions. The two dimensions are eol-style and file-type.

THE SWITCHES

So for the switches, here is what would be meaningful to me, short, sweet:

core.autocrlf  :: true false
core.eolstyle  :: local share lf crlf

If autocrlf is false, then what comes out is exactly what goes in.

EOL-STYLE

The eolstyle property only applies to text files (discussed later):

- "local" means normalize "text" files to LF when read in, and convert
to the platform preferred setting when materializing workspaces.
- "share" means accept anything, but when writing files to a workspace
normalize to LF (XML, XSLT, some scripting languages ...)
- "lf" means always to accept anything though and convert to LF, output LF
- "crlf" means to accept anything and convert to CRLF on output

FILE-TYPES

Linus alluded above file-types, and being explicit about them. That's
great, I agree. Let me provide examples:

By extension:
    http://www.perforce.com/perforce/doc.current/manuals/cmdref/o.ftypes.html

By pathnames or extensions:
    http://www.perforce.com/perforce/doc.current/manuals/cmdref/typemap.html

Don't beat me up for referencing other systems, please. But as people
move to Git from other systems there will be some level of
expectation, so understanding those perspectives and expectations so
you are prepared to provide a meaningful answer would help.

AUTO/TEXT-DETECTION

So the above explicit definitions gets you most of the way, but what
about "auto"? This is a question at the heart of convert.c, the
gather_stats function that classifies among other things whether or
not an input is text or binary.

While gather_stats is a good start, it naively is US-centric; it most
assuredly does not address UTF-8 and ISO-8859-1, both of which are
VERY easy to identify, but are not presently handled by this
algorithm. I wrote a simple stat gatherer for the MATLAB kernel years
ago that classified the character-set of arbitrary input text to one
of about a half-dozen common character-sets, so what about adding in a
lightweight checker for at least UTF-8 and ISO-8859-1? I could provide
such a thing back to this community if people wish.

To have a little more in the gather_stats code to handle a couple more
cases would go a long way and would be easy to add, and does not
necessarily depend up file-type support. It would simply broaden what
it means to be a text file.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html