Re: [Question] Can git cat-file have a type filtering option?

Felipe Contreras <felipe.contreras@xxxxxxxxx> · Sun, 16 Apr 2023 06:06:02 -0600

Linus Torvalds wrote:
> On Fri, Apr 14, 2023 at 5:17 AM ZheNing Hu <adlternative@xxxxxxxxx> wrote:
> >
> > Jeff King <peff@xxxxxxxx> 于2023年4月14日周五 15:30写道：
> > >
> > > On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote:
> > > >
> > > > I'm still puzzled why git calculated the object id based on {type, size, data}
> > > >  together instead of just {data}?
> > >
> > > You'd have to ask Linus for the original reasoning. ;)
> 
> I originally thought of the git object store as "tagged pointers".
> 
> That actually caused confusion initially when I tried to explain this
> to SCM people, because "tag" means something very different in an SCM
> environment than it means in computer architecture.
> 
> And the implication of a tagged pointer is that you have two parts of
> it - the "tag" and the "address". Both are relevant at all points.
> 
> This isn't quite as obvious in everyday moden git usage, because a lot
> of uses end up _only_ using the "address" (aka SHA1), but it's very
> much part of the object store design. Internally, the object layout
> never uses just the SHA1, it's all "type:SHA1", even if sometimes the
> types are implied (ie the tree object doesn't spell out "blob", but
> it's still explicit in the mode bits).
> 
> This is very very obvious in "git cat-file", which was one of the
> original scripts in the first commit (but even there the tag/type has
> changed meaning over time: the very first version didn't use it as
> input at all, then it started verifying it, and then later it got the
> more subtle context of "peel the tags until you find this type").
> 
> You can also see this in the original README (again, go look at that
> first git commit): the README talks about the "tag of their type".
> 
> Of course, in practice git then walked away from having to specify the
> type all the time. It started even in that original release, in that
> the HEAD file never contained the type - because it was implicit (a
> HEAD is always a commit).
> 
> So we ended up having a lot of situations like that where the actual
> tag part was implicit from context, and these days people basically
> never refer to the "full" object name with tag, but only the SHA1
> address.
> 
> So now we have situations where the type really has to be looked up
> dynamically, because it's not explicitly encoded anywhere. While HEAD
> is supposed to always be a commit, other refs can be pretty much
> anything, and can point to a tag object, a commit, a tree or a blob.
> So then you actually have to look up the type based on the address.
> 
> End result: these days people don't even think of git objects as
> "tagged pointers".  Even internally in git, lots of code just passes
> the "object name" along without any tag/type, just the raw SHA1 / OID.
> 
> So that originally "everything is a tagged pointer" is much less true
> than it used to be, and now, instead of having tagged pointers, you
> mostly end up with just "bare pointers" and look up the type
> dynamically from there.
> 
> And that "look up the type in the object" is possible because even
> originally, I did *not* want any kind of "object type aliasing".
> 
> So even when looking up the object with the full "tag:pointer", the
> encoding of the object itself then also contains that object type, so
> that you can cross-check that you used the right tag.
> 
> That said, you *can* see some of the effects of this "tagged pointers"
> in how the internals do things like
> 
>     struct commit *commit = lookup_commit(repo, &oid);
> 
> which conceptually very much is about tagged pointers. And the fact
> that two objects cannot alias is actually somewhat encoded in that: a
> "struct commit" contains a "struct object" as a member. But so does
> "struct blob" - and the two "struct object" cases are never the same
> "object".
> 
> So there's never any worry about "could blob.object be the same object
> as commit.object"?
> 
> That is actually inherent in the code, in how "lookup_commit()"
> actually does lookup_object() and then does object_as_type(OBJ_COMMIT)
> on the result.

This explains rather well why the object type is used in the calculation, and
it makes sense.

But I don't see anything about the object size. Isn't that unnecessary?

-- 
Felipe Contreras