On Fri, Jul 15, 2011 at 12:46 AM, Jeff King <peff@xxxxxxxx> wrote: > > So you don't see a difference between storing the information directly > in the commit object, where it affects the sha1 of the commit, and > calculating and storing it somewhere else? Sure, I see the difference. And I think it's uglier to have two different places for required information. > That is what seems ungit to > me. You aren't adding new information to the DAG (note I said "DAG" and > not commit) that is not already there, but you are changing the ids of > commits in the DAG. Umm. It's redundant, but so what? We have tons of redundant information in there already. Those commits are very explicitly using a 40-byte ASCII representation of the 20-byte SHA1 names. The very original deeper object structure is also redundant: we repeat the object size in the object itself, even though it's part of the implicit object format itself. We also very purposefully repeat the type of the object there, even though the type is basically always redundant (in fact, the core git functions require you to give the type of the object as part of the lookup, and will error out if the SHA1 points to the wrong type). That was one of my original design decisions, exactly because I wanted the redundancy for verification. Redundancy isn't a problem. It's a source of sanity checking. I'm not seeing why you are harping on it. I think it's much worse to have the same information in two different places where it can cause inconsistencies that are hard to see and may not be repeatable. If git ever finds the wrong merge base (because, say, the generation numbers are wrong), I want it to be a *repeatable* thing. I want to be able to repeat on the git mailing list "hey, guys, look at what happens when I try to merge commits ABC and XYZ". If you go "yeah, it works for me", then that is bad. What I tried very hard to do in the git data structures is to make them (a) immutable (so the DAG could never have two-way links, for example) and (b) "simple". Right now, we do *have* a "generation number". It's just that it's very easy to corrupt even by mistake. It's called "committer date". We could improve on it. > Are packfiles unclean, or a random hack? How about pack indices? No. Neither of them are unclean or random. The original git design was very much about thinking of the object space as a "filesystem". Now, the original object layout actually used the native OS filesystem, and I naively thought that would be ok. Using aspecialized filesystem instead doesn't really change anything. It's not fundamentally different from the difference between running git on ext3 or btrfs or nfs or whatever. In fact, I think we've had more filesystem-related bugs wrt NFS than we've had with pack-files. The pack indices are actually kind of ugly - and I would have preferred having them in the same file instead of having the worry of consistency across two different files. They *are* the kind of thing that could cause local inconsistency, but they are fairly simple, and they have some serious protection in them (ie they aren't just SHA1'd in themselves, they contain a SHA1 of the pack-file they index in them to make sure that any inconsistency is findable). Again, that's "redundancy". But I consider the packfile/index to be just a filesystem. It really fundamentally *is* that. Partly for that reason, I do think that if the generation count was embedded in the pack-file, that would not be an "ugly" decision. The pack-files have definitely become "core git data structures", and are more than just a local filesystem representation of the objects: they're obviously also the data transport method, even if the rules there are slightly different (no index, thank god, and incomplete "thin" packs). That said, I don't think a generation count necessarily "fits" in the pack-file. They are designed to be incremental, so it's not very natural there. But I do think it would be conceptually prettier to have the "depth of commit" be part of the "filesystem" data than to have it as a separate ad-hoc cache. > Those things rely on the idea that the git DAG is a data model that we > present to the user, but that we're allowed to do things behind the > scenes to make things faster. .. and that is relevant to this discussion exactly *how*? It's not. It's totally irrelevant. I certainly would never walk away from the DAG model. It's a fundamental git decision, and it's the correct one. But it all boils down to one simple issue: we should have added generation counts back in 2005. It's likely the *one* data format decision that I regret. Using commit dates was wrong. Everybody knew it was wrong, but we ended up going with it just to keep the format constant. If I had realized how small the patch was to add generation counters, and that it wouldn't have broken backwards compatibility (ie fsck doesn't start complaining). I would have done it originally, instead of all the crazy hacks we did for commit date verification. And that is what this discussion fundamentally boils down to for me. If we should have fixed it in the original specification, we damn well should fix it today. It's been "ignorable" because it's just not been important enough. But if git now adds a fundamental cache for them, then that information is clearly no longer "not important enough". Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html