Re: RFC: error codes on exit

Jonathan Nieder <jrnieder@xxxxxxxxx> · Wed, 19 May 2021 18:55:16 -0700

Hi,

Junio C Hamano wrote:
> Jonathan Nieder <jrnieder@xxxxxxxxx> writes:

>> In order to do this, I would like to annotate "exit" events with a
>> classification of the error.
>
> We already have BUG() for e. and die() for everything else, and
> "everything else" may be overly broad for your purpose.
>
> I am sympathetic to the cause and I agree that introducing a
> finer-grained classification might be a solution.  I however am not
> sure how we can enforce developers to apply such a manually assigned
> "error code" cosistently.

I think two things you're hinting at are "what about maintainability?"
and "what is the migration path?"

I suspect that a number of error paths will remain unclassified for a
long time, possibly indefinitely.  The way I've seen this treated in
other tools is that it's okay for something to show up as an INTERNAL
error if it doesn't happen frequently: sure, that can cause us a bit
of unnecessary worry when it starts occuring more often, but at that
point we're in a good place to replace it with something more
appropriate.

That means that we would still want to keep die() or some equivalent.
That in turn might suggest that the API I suggested is overly verbose;
it might make sense to have a different die()-style helper for each
type of error, matching what we do with die() and BUG().

Side note: you might wonder why keeping die() would even be a
question.  For example, there are all the outstanding patches that
still use die(); changing such a fundamental API would seem to be a
nonstarter.  Fortunately, though, the tools in
contrib/coccinelle/README allow changing an API in three steps:

 1. Introduce the new API.  Keep the old API around for backward
    compatibility.

 2. Add a "pending" coccinelle semantic patch to automatically
    update callers to the new API.  Update existing callers using
    'make coccicheck-pending'.

 3. Remove the old API and mark the semantic patch as no longer
    pending.  Patches using the old API can be fixed using 'make
    coccicheck'

So we can make this decision based on whether the resulting API is one
we like more; in this example, I suspect that keeping die() is
preferable _even though_ it would be possible to remove by staying in
step 2 for a while without too much fuss.

> Just to throw in a totally different alternative to see if it works
> better, I wonder if you can teach die() to report to the trace2
> stream where in the code it was called from and which vintage of Git
> it is running.
>
> The stat collection side that cares about certain class of failures
> can have function that maps "die() at <filename>:<lineno>@<version>"
> to "what kind of die() it is".
>
> E.g.  blame.c:50@v2.32.0-rc0-184-gbbde7e6616" may be BUG(), while
> blame.c:2740@v2.32.0-rc0-184-gbbde7e6616 may be an user-error.

For ad hoc queries, this is a rather nice tool.  Traces already record
filename, version, and line number, though I believe in the die() case
it currently just points to the implementation of die(). ;-)

However, for analysis in aggregate (for example, to define an SLO[1])
that would require us to maintain a database that maps
<filename>:<lineno> to error code.  That database would be essentially
the same as patches to record the error codes, so what it would really
amount to is having these deployments using a permanent fork of Git.
It would also get rid of the chance to discuss and improve common
error paths on-list.

If we expect the error codes to not be useful to anyone else, then
that is the right choice to make (or rather, we'd have to use other
heuristics, such as having the traces record a collection of offsets
in the binary and a build-id so we can key off of stack trace
signatures).  Part of the reason I started this thread is to get a
sense of whether these can be useful to others.

Thanks,
Jonathan

[1] https://sre.google/sre-book/service-level-objectives/