On Fri, Apr 14, 2023 at 5:17 AM ZheNing Hu <adlternative@xxxxxxxxx> wrote: > > Jeff King <peff@xxxxxxxx> 于2023年4月14日周五 15:30写道: > > > > On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote: > > > > > > I'm still puzzled why git calculated the object id based on {type, size, data} > > > together instead of just {data}? > > > > You'd have to ask Linus for the original reasoning. ;) I originally thought of the git object store as "tagged pointers". That actually caused confusion initially when I tried to explain this to SCM people, because "tag" means something very different in an SCM environment than it means in computer architecture. And the implication of a tagged pointer is that you have two parts of it - the "tag" and the "address". Both are relevant at all points. This isn't quite as obvious in everyday moden git usage, because a lot of uses end up _only_ using the "address" (aka SHA1), but it's very much part of the object store design. Internally, the object layout never uses just the SHA1, it's all "type:SHA1", even if sometimes the types are implied (ie the tree object doesn't spell out "blob", but it's still explicit in the mode bits). This is very very obvious in "git cat-file", which was one of the original scripts in the first commit (but even there the tag/type has changed meaning over time: the very first version didn't use it as input at all, then it started verifying it, and then later it got the more subtle context of "peel the tags until you find this type"). You can also see this in the original README (again, go look at that first git commit): the README talks about the "tag of their type". Of course, in practice git then walked away from having to specify the type all the time. It started even in that original release, in that the HEAD file never contained the type - because it was implicit (a HEAD is always a commit). So we ended up having a lot of situations like that where the actual tag part was implicit from context, and these days people basically never refer to the "full" object name with tag, but only the SHA1 address. So now we have situations where the type really has to be looked up dynamically, because it's not explicitly encoded anywhere. While HEAD is supposed to always be a commit, other refs can be pretty much anything, and can point to a tag object, a commit, a tree or a blob. So then you actually have to look up the type based on the address. End result: these days people don't even think of git objects as "tagged pointers". Even internally in git, lots of code just passes the "object name" along without any tag/type, just the raw SHA1 / OID. So that originally "everything is a tagged pointer" is much less true than it used to be, and now, instead of having tagged pointers, you mostly end up with just "bare pointers" and look up the type dynamically from there. And that "look up the type in the object" is possible because even originally, I did *not* want any kind of "object type aliasing". So even when looking up the object with the full "tag:pointer", the encoding of the object itself then also contains that object type, so that you can cross-check that you used the right tag. That said, you *can* see some of the effects of this "tagged pointers" in how the internals do things like struct commit *commit = lookup_commit(repo, &oid); which conceptually very much is about tagged pointers. And the fact that two objects cannot alias is actually somewhat encoded in that: a "struct commit" contains a "struct object" as a member. But so does "struct blob" - and the two "struct object" cases are never the same "object". So there's never any worry about "could blob.object be the same object as commit.object"? That is actually inherent in the code, in how "lookup_commit()" actually does lookup_object() and then does object_as_type(OBJ_COMMIT) on the result. > Oh, you are right, this could be to prevent conflicts between Git objects > with identical content but different types. However, I always associate > Git with the file system, where metadata such as file type and size is > stored in the inode, while the file data is stored in separate chunks. See above: yes, git design was *also* influenced heavily by filesystems, but that was mostly in the sense of "this is how to encode these things without undue pain". The object database being immutable was partly a security and safety measure, but it was also very much partly a "rewriting files is going to be a major pain from a filesystem consistency standpoint - don't do it". But even more than a filesystem design, it's an "computer architecture" design. Think of the git object store as a very abstract computer architecture that has tagged pointers, stable storage, and no aliasing - and where the tag is actually verified at each lookup. The "no aliasing" means that no two distinct pointers can point to the same data. So a tagged pointer of type "commit" can not point to the same object as a tagged pointer of type "blob". They are distinct pointers, even if (maybe) the commit object encoding ends up then being identical to a blob object. And as mentioned, that "verified at each lookup" has mostly gone away, and "each lookup" has become more of a "can be verified by fsck", but it's probably still a good thing to think that way. You still have "lookup_object_by_type()" internally in git that takes the full tagged pointer, but almost nobody uses it any more. The closest you get is those "lookup_commit()" things (which are fairly common, still). Linus