Linus Torvalds wrote: > On Fri, Apr 14, 2023 at 5:17 AM ZheNing Hu <adlternative@xxxxxxxxx> wrote: > > > > Jeff King <peff@xxxxxxxx> 于2023年4月14日周五 15:30写道: > > > > > > On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote: > > > > > > > > I'm still puzzled why git calculated the object id based on {type, size, data} > > > > together instead of just {data}? > > > > > > You'd have to ask Linus for the original reasoning. ;) > > I originally thought of the git object store as "tagged pointers". > > That actually caused confusion initially when I tried to explain this > to SCM people, because "tag" means something very different in an SCM > environment than it means in computer architecture. > > And the implication of a tagged pointer is that you have two parts of > it - the "tag" and the "address". Both are relevant at all points. > > This isn't quite as obvious in everyday moden git usage, because a lot > of uses end up _only_ using the "address" (aka SHA1), but it's very > much part of the object store design. Internally, the object layout > never uses just the SHA1, it's all "type:SHA1", even if sometimes the > types are implied (ie the tree object doesn't spell out "blob", but > it's still explicit in the mode bits). > > This is very very obvious in "git cat-file", which was one of the > original scripts in the first commit (but even there the tag/type has > changed meaning over time: the very first version didn't use it as > input at all, then it started verifying it, and then later it got the > more subtle context of "peel the tags until you find this type"). > > You can also see this in the original README (again, go look at that > first git commit): the README talks about the "tag of their type". > > Of course, in practice git then walked away from having to specify the > type all the time. It started even in that original release, in that > the HEAD file never contained the type - because it was implicit (a > HEAD is always a commit). > > So we ended up having a lot of situations like that where the actual > tag part was implicit from context, and these days people basically > never refer to the "full" object name with tag, but only the SHA1 > address. > > So now we have situations where the type really has to be looked up > dynamically, because it's not explicitly encoded anywhere. While HEAD > is supposed to always be a commit, other refs can be pretty much > anything, and can point to a tag object, a commit, a tree or a blob. > So then you actually have to look up the type based on the address. > > End result: these days people don't even think of git objects as > "tagged pointers". Even internally in git, lots of code just passes > the "object name" along without any tag/type, just the raw SHA1 / OID. > > So that originally "everything is a tagged pointer" is much less true > than it used to be, and now, instead of having tagged pointers, you > mostly end up with just "bare pointers" and look up the type > dynamically from there. > > And that "look up the type in the object" is possible because even > originally, I did *not* want any kind of "object type aliasing". > > So even when looking up the object with the full "tag:pointer", the > encoding of the object itself then also contains that object type, so > that you can cross-check that you used the right tag. > > That said, you *can* see some of the effects of this "tagged pointers" > in how the internals do things like > > struct commit *commit = lookup_commit(repo, &oid); > > which conceptually very much is about tagged pointers. And the fact > that two objects cannot alias is actually somewhat encoded in that: a > "struct commit" contains a "struct object" as a member. But so does > "struct blob" - and the two "struct object" cases are never the same > "object". > > So there's never any worry about "could blob.object be the same object > as commit.object"? > > That is actually inherent in the code, in how "lookup_commit()" > actually does lookup_object() and then does object_as_type(OBJ_COMMIT) > on the result. This explains rather well why the object type is used in the calculation, and it makes sense. But I don't see anything about the object size. Isn't that unnecessary? -- Felipe Contreras