On Sun, 11 Feb 2007, Theodore Tso wrote: > > So this is something that I've tried proposing to the Mercurial > developers, but it's never been implemented in hg. It'll be > interesting to see what the git community thinks. :-) > > My proposal does require adding a file type to each file, as tracked > metadata, which may doom it from the start. If you add a file type, > then you have to support mutating the file type, and some way of > handling merge conflicts (generally, picking one type or another). I agree that a file-type approch would work, but I personally think it's too inflexible (just cr/lf vs lf? There are tons of other interesting issues that are valid). I also think it falls down on another (and in some ways much more fundamental problem): these things exist EVEN WHEN THE FILE ITSELF DOES NOT EXIST! In other words, a policy about cr/lf is *not* a policy about actual content. It's something much more: it's a policy about representation in general, which includes *potential* content. It should obviously take effect on "git add" even with content that didn't exist before, and to work well, it should do so without the user having to think about it. Equally importantly, this happens with content that was added by people who simply DO NOT CARE. In other words, I think a "file type" thing fundamentally cannot work, because under UNIX, it would be stupid and pointless, so any project that is maintained under UNIX might _add_ the file types, but since they won't matter, they'll inevitably be wrong (ie people forgot to mark a binary thing binary, or a text thing as text). So: file types or attributes are broken. They cannot work well. But enough on the negative rambling, I do have a positive and constructive suggestion, because I actually think I have a great model for it. But I've never cared enough (and since the main target would be some windows issue, I suspect I never really _will_ care enough) to really worry about it. Anyway, if somebody really wants to look at this, and wants to create something that is actually _usable_, my suggestion is to simply extend on the ".gitignore" file approach. The great thing about .gitignore is that (a) you can track it like you track any other file This makes merges a *lot* easier. You see it as conflicts, you can fix it up, and in general, you can use all the same tools with it as you use with anything else. In contrast, explicit per-file filetypes are _horrible_ for maintenance. (b) you can add to it with *patterns*, which is exactly what you want for file types. You can do things like *.bin: binary *: text to say "everythgn that matches *.bin is binary, the rest is text", and solves the maintenance issue trivially. Everybody will like it. For the kernel, for example, we'd have a really easy Documentation/logo.gif: binary *: text and that would probably take care of it. You can also have a few default file patterns built in, which would take care of it for 99% of all projects without anybody ever having to even think about it - even under DOS. (c) it doesn't actually affect database representation, it only changes behaviour for programs, which is also exactly what you want (if you have per-file "file types", you end up having serious problems at merge time: when I say "affect database representation", I don't mean that I think git cannot change its database, I literally mean at a "higher" level: represening per-file attributes is a DISASTER from a merge situation) So not only is it backwards-compatible with traditional git usage, it's much more fundamentally simple: it doesn't add any new core data structures or rules. All the core stays exactly as it is, and it just affects higher-level behaviour. And that's important: one reason git has been so stable is that the really core data structures are really really stable and simple. Even when we did *really* core changes like the whole packfile thing, the fundamental data structures didn't change at all *conceptually*. (d) it's actually a lot more flexible than file types. Merge stategies, anybody? We can easily have the default merge strategy be the normal three-way merge (which is obviously the right thing for almost anything), but how about something like *.doc: binary,merge=doc-merge which tells git that it should use a separate "doc-merge" program to merge those kinds of files when it needs to do a nontrivial merge.. (e) exactly like ".gitignore", you should also be able to have a ".git/info/exclude" file that is your _private_ rules, and per-directory ".gitignore" files that are the _hierarchical_ rules. This just makes maintenance much simpler. Not one big file that has everything, and that clashes. Make the top-level one contain all the generic default rules, and then lower down we can have more specific rules for very specific things, exactly like the kernel .gitignore files do. The top-level file should *not* have to know all the details of some architecture- or sub-project specific file behaviour. Similarly, having an untracked file (.git/info/exclude) allows people to have rules that make sense for *them*, but that might not make sense for the upstream developers (say, somebody crazy enough to develop Linux under Windows). So people can have their purely local rules without forcing them on others. Anyway, that would be my suggestion. Call it ".gitattributes" or something. Make it a nice ASCII format, exactly like .gitignore, and make all the rules exactly the same, except it has a ": <attributelist>" at the end for each line. Start off supporting just "binary" and "text", but keep in mind that people may want other things. Individualized merge strategies etc. Also, keep in mind that a *lot* of git operations will work purely on a SHA1 level, and those operations fundamentally *will*not*care* about file types. So when you merge a file, for example, the initial merge will be done purely on SHA1's, and git would do all the normal "if it didn't change in branch 1, take the branch 2 version directly" without ever even *looking* at any file rules. This is important, because this is what makes git efficient for large projects, and which would allow git to _remain_ efficient even in the face of having to read all those comples .gitattributes files. When we merge two repositories with 20,000+ files, we usually really only "merge" a couple of the files. Same goes for "text" mode. The "text" thing would only affect things like "git add" etc that use "git-update-index" to calculate the new SHA1. We'd never use it "normally". "git diff" would still be instantaneous, because the git index shows the file still matches, and that is all done on a SHA1 only level. So only when you do a "git add" or when it needs to refresh the index because the file changed, and it reads in the file, will it actually care about whether it's a text or a binary file. This is actually *exactly* what you want. Not just for performance, but simply because this is also how you can take something like the Linux archive, and "just use it" under Windows, even if your editor adds (or wants) CR/LF. Btw, how would I implement this? If I really were energetic enough to implement it, I would do: (a) Add a flag to "git-ls-files" logic to add "type information" in front. Not only do you want this *anyway* for other reasons, but for binary/text, the thing you actually care most about is "git add", and it already basically just does "take this file pattern, feed it through git-ls-files, and add those files". So you'd get it basically for free. It is also fairly easy to add at this stage, because you can simply look for all the places that work with "info/exclude" and ".gitignore", and you know that "Ahh, I need to teach these exact places to understand about attributes". So you'd add an "add_attributes_from_file()" function etc etc. Quite straightforward. In fact, you might be able to use the gitignore parsing *as*is*, and just teach it about more flags that just "ignore": both in "struct dir_entry" and in "struct exclude". (b) Teach the git-update-index logic about hashing text blobs. (c) Profit! It really should be fairly straightforward. I'm sure it wouldn't be *entirely* trivial, but I'm also fairly sure that somebody reasonably competent could do it in a couple of days (with testing) if they were just sufficiently motivated to get started. Anybody? Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html