On Tue, 24 Jul 2007, Nix wrote: > > On 23 Jul 2007, Linus Torvalds spake thusly: > > So practically speaking, you want to track the *minimal* possible state, > > not the maximal one. > > I think it depends on your use case. For source code and indeed anything > with heavy merges, this is true Yes, very obviously. Git is targeted towards source code and working in a distributed manner across a very wide variety of users and setups, while something that would be more targeted towards a special scenario and much stricter usage would find that the "minimum" set is much bigger, and might well include ACL's and usr information. > but I'm increasingly using git as a sort of `merged historical tar' to > store images of entire random filesystem trees across time, and gaining > the benefit of the packer's lovely space-efficiency as well (doing this > with svn would be a lost cause, twice the space usage before you even > think about the repository). And in that case, preserving everything you > can makes sense. On the other hand, almost all the space-efficiency comes from things that delta well, and change quickly. That includes the file data itself (and very much the tree contents), but it doesn't necessarily include things like permissions and user information - mainly because that doesn't actually delta at all (not because it can't, but because it hardly ever changes, and when it does change, it often changes all over the map). To make an example of your "tar" situation: if you want to be space- efficient in a tar-like setting, you should *not* make user information be something that is per-file at all! Why? Because in 99% of all tar-files, there is a single user name. So even your usage *may* actually be much better off using git as a "data backend", and using something totally different for "user/group" information. Yes, you'd have to make a "shim layer" on top of git to hide the fact that the user information is handled separately, but that shouldn't be that hard per se. > (Perhaps what I should be doing is tarring the directory tree up and > storing the *tarball* in git. I'll try that and see what it does to pack > sizes. These are version-controlled backups of my mother's magnum opus > in progress so you can understand that I don't want to destroy them > accidentally: I'd never hear the end of it! ;) ) You don't want to do this. There's a few reasons, but the two big ones are: - the git delta logic is strictly a "single delta base" thing. Yes, git would be able to find the delta's between two tar-files (as long as you don't compress them), and express one tar-file in terms of the other, and it would probably save a fair amount of disk. But it would not be able to do _nearly_ as well as it can if you store individual files, and let git just find the best delta per-file (and not just "one delta base for the whole tar-ball") - git is very much optimized for "many small files". Yes, you can check in large files, and it works fine, but quite frankly, all the design and heavy optimizations have been about having trees with tens of thousands of files, but the files individually reasonably small. A lot of the speed advantages of git come from efficiently pruning away whole sub-directory structures, for example, and not even touching the data at all! So if you track just one file that changes in every version, all the things that make git fly are basically disabled, and you won't take full advantage of what git does. > Yes indeed: that's why I proposed doing this using a couple of new hooks > driving entirely optional permissions-preservation stuff. Most use cases > really won't want to track this, so this sort of stuff shouldn't impose > upon the git core or upon anyone who doesn't want it. (However, the > ability to have alternative file merging strategies *may* be useful > elsewhere, perhaps.) The ".gitattributes" file really could be used for some of that. Using it to track ownership and full permissions would not be impossible, and it could have interesting semantics (especially as .gitattibutes is path pattern based - so you could literally do a "user" attribute, and say that everything in a particular subdirectory is owned by a particular user). That wouldn't be UNIX-like semantics, of course, but it can be very useful for certain things. Taking an example of something totally independent of git, look at how "udev" handles permissions, for example. In situations like that, static user information is useless, and it actually ends up setting up modes and ownership based on name-based patterns rather than having each file have a permission/user (because individual files appear and disappear, the name-based patterns are the things that matter). So if you *just* want to track a regular filesystem layout, that's not the right thing, but "udev" does show an example of a totally different way of describing ownership and permissions, and one which wouldn't actually be at all foreign to git. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html