Options like hash-object --literally and cat-file --allow-unknown-type

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Tue, 18 May 2021 16:08:18 +0200

[Breaking the In-Reply-To chain per recent discussions on list, would
have been sent out with [1] if not. See
https://lore.kernel.org/git/xmqqo8eem5hr.fsf@gitster.g/ for the
replied-to message in context].

Skip to "Get to the point" for the "tl;dr".

On Fri, Apr 16 2021, Junio C Hamano wrote:

> Han-Wen Nienhuys <hanwen@xxxxxxxxxx> writes:
>
>>> I would think that a better approach here would be to start with some
>>> (per-se unrelated) series to teach update-ref some mode like
>>> hash-object's --literally, i.e. "YOLO this ref update".
>>
>> I disagree.  I think this would be a job better suited to a
>> test-helper. Then we don't put tools into users' hands that
>> potentially corrupt the repository. I don't understand why hash-object
>> --literally is not a test helper either.
>
> As the person who invented the "--literally" option, I'd have to
> agree with this assessment.  It does make life a little bit easier
> for those who hack on Git codebase and reimplementations of it, but
> little practical value for those who use Git every day [*].
>
> If it were a tool to _dump_ the contents of a possibly corrupt
> object that the "--literally" option would have produced to make it
> easier to see by humans, I might be pursuaded to say that such a
> feature may be better in end-user accessible subcommands.  But the
> reason why we invented "--literally" was specifically to create
> corrupt objects in test repository to see end-user accessible tools
> behave, and in hindsight, such a use case shouldn't have been a good
> justification to add the option.
>
> I wasn't following the discussion of this particular patch closely,
> so I do not know what is being discussed is a tool on the diagnosis
> side (for which I may say it is OK to give to end-usres) or on the
> currupting side (for which I would regret to add to end-user tools),
> but hopefully the above would help guiding the decision between the
> two.
>
> Thanks.
>
>
> [Footnote]
>
> * ... for any purpose other than creating a corrupt repository,
>   asking somebody who is or claims to be a Git expert to figure out
>   what is wrong in his or her repository, and either waste expert's
>   time or have fun by watching the expert fail and gets embarrased,
>   that is ;-)

I totally get where you're coming from. We're talking about handing
users a loaded shotgun without the safety pin with --literally.

Having had some time to think about it though (and encountering various
code around it) since this E-Mail I think that on balance we're better
off with it than without it, and should extend it to update-ref et al.

I'm very much cognitively biased on the topic since 100% of my
interaction with these --literally objects is in hacking on git.git
itself, which if anything should serve as an argument not to have this
sort of thing in public tooling.

But I think one thing that makes git great is that we don't have opaque
magic at any level. Sure, we've got complexity, but you can always
devolve thing down to "simple" manipulation of the object store in cases
such as writing refs / loose objects.

So if for nothing else having a tool like hash-object take your raw
input is good for the educational value, even if it does (and should) do
sanity checking by default.

I actually think that like "mktag" it should do even more sanity
checking, I just think there should always be a way to turn that sort of
thing off. At the very lowest level our plumbing shouldn't be caring
about object validation at all. That also neatly allows us to
e.g. support creating an OBJ_NEW with a git version that doesn't
understand such a thing. We're unlikely to add new object types, but I
don't see why it shouldn't be supported.

In any case, to get to the point what prompted me to write this E-Mail
is that I'm annoyed that "cat-file --allow-unknown-type" exists at all,
i.e. I don't see the need for something like a "--literally" (originally
it was "cat-file --literally", see [2]) if we're talking about *reading*
objects, not writing them (hash-object without -w not withstanding).

In re-rolling my fsck fixes for what it does on unknown types[3] I've
ended up with sillyness like replacing the current bad error message
cat-file emits:

    $ git cat-file -t 7a52aafca353d2fd9f3faca575e9fe4fd48619d4
    fatal: invalid object type

With:

    $ ~/g/git/git cat-file -t 7a52aafca353d2fd9f3faca575e9fe4fd48619d4
    fatal: object 7a52aafca353d2fd9f3faca575e9fe4fd48619d4 is of unknown type 'bogus', refusing to emit it without --allow-unknown-type

I.e. now the guts of the object-file.c code don't have a way to ferry
that information up. Now it does via a "struct object_info", but (for
compatibility with documented behavior) it still refuses to "really"
print the type without --allow-unknown-type.

I think it should just print it unconditionally, perhaps with a
--no-allow-unknown-type "strict" mode.

We also currently don't support:

    $ git cat-file -p 7a52aafca353d2fd9f3faca575e9fe4fd48619d4
    fatal: invalid object type

But I've added support for that, we don't need to know about the type
itself to dump the raw content after the header, which is more useful,
and means you can roundtrip hash-object --literally and cat-file.

But most annoying is:

    $ echo 7a52aafca353d2fd9f3faca575e9fe4fd48619d4 | git cat-file --batch
    fatal: invalid object type

Now, looking at the implementation that's really a bug. We document:

    objecttype: The type of the object (the same as cat-file -t reports).

So you might expect it to accept an --allow-unknown-type, but it
doesn't. I could add support for that, but I think it would be stupid.

Why should you need to restart a "cat-file --batch" just because you
encounter a bad object? Just .. print it, we can do that safely. I
really don't see the point of having --allow-unknown-type at all. Ditto
for the --batch-check mode.

I mean, having read all the code I think I know why it's there. I think
It's because there was no way to ferry the information up from
object-file.c before my yet-to-be-submitted patches, so the solution was
to pass down a flag saying "please don't die()".

But is it something that anyone thinks is a good idea in the abstract? I
don't see why I shouldn't just document something like:

    Older versions of "cat-file" used to require an --allow-unknown-type
    flag to emit information about objects of unknown types. This is no
    longer required or the default. If you'd like to die on anything
    except known types (e.g. to die instead of bothering with parsing a
    "bad" type that possibly has spaces in it in the --batch output)
    supply --no-allow-unknown-type.

What do you think?

1.

    > In-reply-to: <xmqqo8eem5hr.fsf@gitster.g>

    > References:
    > <pull.847.v5.git.git.1615580397.gitgitgadget@xxxxxxxxx>
    > <pull.847.v6.git.git.1618255552.gitgitgadget@xxxxxxxxx>
    > <2dc73bf2ec96acae12a17c719083d11401775bc3.1618255553.git.gitgitgadget@xxxxxxxxx>
    > <87im4qejpk.fsf@xxxxxxxxxxxxxxxxxxx>
    > <CAFQ2z_OSbeciLQnoognG+Hh5S1tZTHO6WUviC9h5YMer766k6g@xxxxxxxxxxxxxx>
    > <xmqqo8eem5hr.fsf@gitster.g>

2. https://lore.kernel.org/git/CAPig+cTcAq_p3QXqcG+o1saWZyvDHCW=_JWYn6s7B1L4X5X1cQ@xxxxxxxxxxxxxx/

3. https://lore.kernel.org/git/87im493yos.fsf@xxxxxxxxxxxxxxxxxxx/