Re: CRLF, LF ... CR ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Junio and David.

Rule is in fact quite simple.
If it's a text-file and it contains a LF, a CRLF or a CR, then that's a line-break. :)
-So everywhere a LF is checked for, a CR should most likely be checked for.
Usually, when checking for CRLF, one is looking for the LF. If a CR precedes the LF, the CR is discarded.
It's done this way, in case we have a small buffer (say 100 bytes), and we read the file only 100 bytes each time.
If we searched for CR instead, we can't check the next character without a lot of clumsy or slow coding.
-But even today, I don't believe that a line would exceed 2000 characters; although, one could have a 16K buffer, then *if* the last character in that buffer is a CR, read ahead (just a single byte read would be acceptable in that case).

Unfortunately, it seems that it's not only old Mac OS users that have the problem:
<http://stackoverflow.com/questions/10491564/git-and-cr-vs-lf-but-not-crlf>

...That's a Windows user, who seem to be quite lazy - but if he already has a huge repository filled with CR, I do understand why he don't want to change things.
Here, adding scanning for CR later, can still solve his problem, because the files are not modified.

But on Mac OS X, there are also problems. If an application was written long ago (for Mac OS 9 or Carbon), it might still use CR instead of LF, if it's ported.

Anyway, Linus is right about not just popping code in, because someone, somewhere needs it for a single use - even if that is me. ;)
It's not leading to any kind of crashes if there's no scan for CR.
But on the other hand, it saves the community from FAQ about how to get it working, and it's always an advantage not modifying files more than necessary, when dealing with a VCS/SCM.

The safest way is to hunt for the LF, remembering the previous character. The following is just some untested code I wrote as an example on how it could approximately be done - it does not have to use this many lines. This is a buffer-based example:

#define LF 0x0a
#define CR 0x0d

	const char	*buffer;		/* buffer containing entire text file */
	const char	*b;				/* beginning of line */
	const char	*s;				/* source */
	const char	*e;				/* end pointer */
	char		c;
	char		last;
	const char	*l;				/* pointer to last character */

	s = buffer;
	e = &s[length];
	b = s;
	c = 0;
	while(s < e)
	{
		last = c;
		c = *s++;
		if(LF == c)				/* deal with Linux, UNIX and DOS */
		{
			b = l;
			// we have a linefeed -> new line.
			l = s;
			if(CR == last)
			{
				l = s - 1;
			}
			/* line contents are from b to l */
		}
		else if(CR == last)		/* deal with old Mac OS and other weirdos. ;) */
		{
			b = l;
			l = s - 1;
			/* line contents are from b to l */
		}
	}

-As written above, it can be optimized. There's a small bug though; it doesn't scan if the very last character in the file is a CR.


Love
Jens

On Wed, 26 Sep 2012 23:16:38 -0700, Junio C Hamano wrote:
> David Aguilar <davvid@xxxxxxxxx> writes:
> 
>> That said, perhaps the "autocrlf" code is simple enough that it
>> could be easily tweaked to also handle this special case,...
> 
> I wouldn't be surprised if it is quite simple.
> 
> We (actually Linus, IIRC) simply declared from the get-go that it is
> not worth spending any line of code only to worry about pre OSX
> Macintosh when we did the end-of-line stuff, and nobody so far
> showed any need.
> 
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]