Re: mingw, windows, crlf/lf, and git

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 12 Feb 2007 09:20:43 -0800 (PST)

On Sun, 11 Feb 2007, Theodore Tso wrote:
> 
> So this is something that I've tried proposing to the Mercurial
> developers, but it's never been implemented in hg.  It'll be
> interesting to see what the git community thinks.  :-)
> 
> My proposal does require adding a file type to each file, as tracked
> metadata, which may doom it from the start.  If you add a file type,
> then you have to support mutating the file type, and some way of
> handling merge conflicts (generally, picking one type or another).

I agree that a file-type approch would work, but I personally think it's 
too inflexible (just cr/lf vs lf? There are tons of other interesting 
issues that are valid). I also think it falls down on another (and in some 
ways much more fundamental problem): these things exist EVEN WHEN THE FILE 
ITSELF DOES NOT EXIST!

In other words, a policy about cr/lf is *not* a policy about actual 
content. It's something much more: it's a policy about representation in 
general, which includes *potential* content. It should obviously take 
effect on "git add" even with content that didn't exist before, and to 
work well, it should do so without the user having to think about it.

Equally importantly, this happens with content that was added by people 
who simply DO NOT CARE. In other words, I think a "file type" thing 
fundamentally cannot work, because under UNIX, it would be stupid and 
pointless, so any project that is maintained under UNIX might _add_ the 
file types, but since they won't matter, they'll inevitably be wrong (ie 
people forgot to mark a binary thing binary, or a text thing as text).

So: file types or attributes are broken. They cannot work well.

But enough on the negative rambling, I do have a positive and constructive 
suggestion, because I actually think I have a great model for it. But I've 
never cared enough (and since the main target would be some windows issue, 
I suspect I never really _will_ care enough) to really worry about it.

Anyway, if somebody really wants to look at this, and wants to create 
something that is actually _usable_, my suggestion is to simply extend on 
the ".gitignore" file approach. The great thing about .gitignore is that

 (a) you can track it like you track any other file

     This makes merges a *lot* easier. You see it as conflicts, you can 
     fix it up, and in general, you can use all the same tools with it as 
     you use with anything else. In contrast, explicit per-file filetypes 
     are _horrible_ for maintenance.

 (b) you can add to it with *patterns*, which is exactly what you want for 
     file types.

     You can do things like

	*.bin: binary
	*: text

     to say "everythgn that matches *.bin is binary, the rest is text", 
     and solves the maintenance issue trivially. Everybody will like it. 
     For the kernel, for example, we'd have a really easy

	Documentation/logo.gif: binary
	*: text

     and that would probably take care of it.

     You can also have a few default file patterns built in, which would 
     take care of it for 99% of all projects without anybody ever having 
     to even think about it - even under DOS.

 (c) it doesn't actually affect database representation, it only changes 
     behaviour for programs, which is also  exactly what you want (if you 
     have per-file "file types", you end up having serious problems at 
     merge time: when I say "affect database representation", I don't mean 
     that I think git cannot change its database, I literally mean at a 
     "higher" level: represening per-file attributes is a DISASTER from a 
     merge situation)

     So not only is it backwards-compatible with traditional git usage, 
     it's much more fundamentally simple: it doesn't add any new core data 
     structures or rules. All the core stays exactly as it is, and it just 
     affects higher-level behaviour. And that's important: one reason git 
     has been so stable is that the really core data structures are really 
     really stable and simple.

     Even when we did *really* core changes like the whole packfile thing, 
     the fundamental data structures didn't change at all *conceptually*.

 (d) it's actually a lot more flexible than file types.

     Merge stategies, anybody? We can easily have the default merge 
     strategy be the normal three-way merge (which is obviously the right 
     thing for almost anything), but how about something like

	*.doc: binary,merge=doc-merge

     which tells git that it should use a separate "doc-merge" program to 
     merge those kinds of files when it needs to do a nontrivial merge..

 (e) exactly like ".gitignore", you should also be able to have a 
     ".git/info/exclude" file that is your _private_ rules, and 
     per-directory ".gitignore" files that are the _hierarchical_ rules.

     This just makes maintenance much simpler. Not one big file that has 
     everything, and that clashes. Make the top-level one contain all the 
     generic default rules, and then lower down we can have more specific 
     rules for very specific things, exactly like the kernel .gitignore 
     files do. The top-level file should *not* have to know all the 
     details of some architecture- or sub-project specific file behaviour.

     Similarly, having an untracked file (.git/info/exclude) allows people 
     to have rules that make sense for *them*, but that might not make 
     sense for the upstream developers (say, somebody crazy enough to 
     develop Linux under Windows). So people can have their purely local 
     rules without forcing them on others.

Anyway, that would be my suggestion. Call it ".gitattributes" or 
something. Make it a nice ASCII format, exactly like .gitignore, and make 
all the rules exactly the same, except it has a ": <attributelist>" at the 
end for each line.

Start off supporting just "binary" and "text", but keep in mind that 
people may want other things. Individualized merge strategies etc.

Also, keep in mind that a *lot* of git operations will work purely on a 
SHA1 level, and those operations fundamentally *will*not*care* about file 
types. So when you merge a file, for example, the initial merge will be 
done purely on SHA1's, and git would do all the normal "if it didn't 
change in branch 1, take the branch 2 version directly" without ever even 
*looking* at any file rules.

This is important, because this is what makes git efficient for large 
projects, and which would allow git to _remain_ efficient even in the face 
of having to read all those comples .gitattributes files. When we merge 
two repositories with 20,000+ files, we usually really only "merge" a 
couple of the files. 

Same goes for "text" mode. The "text" thing would only affect things like 
"git add" etc that use "git-update-index" to calculate the new SHA1. We'd 
never use it "normally". "git diff" would still be instantaneous, because 
the git index shows the file still matches, and that is all done on a SHA1 
only level. So only when you do a "git add" or when it needs to refresh 
the index because the file changed, and it reads in the file, will it 
actually care about whether it's a text or a binary file.

This is actually *exactly* what you want. Not just for performance, but 
simply because this is also how you can take something like the Linux 
archive, and "just use it" under Windows, even if your editor adds (or 
wants) CR/LF.

Btw, how would I implement this? If I really were energetic enough to 
implement it, I would do:

 (a) Add a flag to "git-ls-files" logic to add "type information" in 
     front.

     Not only do you want this *anyway* for other reasons, but for
     binary/text, the thing you actually care most about is "git add", and 
     it already basically just does "take this file pattern, feed it 
     through git-ls-files, and add those files". So you'd get it basically 
     for free.

     It is also fairly easy to add at this stage, because you can simply 
     look for all the places that work with "info/exclude" and 
     ".gitignore", and you know that "Ahh, I need to teach these exact 
     places to understand about attributes". So you'd add an 
     "add_attributes_from_file()" function etc etc.

     Quite straightforward. In fact, you might be able to use the 
     gitignore parsing *as*is*, and just teach it about more flags that 
     just "ignore": both in "struct dir_entry" and in "struct exclude".

 (b) Teach the git-update-index logic about hashing text blobs.

 (c) Profit!

It really should be fairly straightforward. I'm sure it wouldn't be 
*entirely* trivial, but I'm also fairly sure that somebody reasonably 
competent could do it in a couple of days (with testing) if they were just 
sufficiently motivated to get started.

Anybody?

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html