Re: RFC: error codes on exit

Jeff King <peff@xxxxxxxx> · Thu, 20 May 2021 09:28:30 -0400

On Wed, May 19, 2021 at 04:34:24PM -0700, Jonathan Nieder wrote:

> One kind of signal we haven't been able to make good use of is error
> rates.  The problem is that a die() call can be an indication of
> 
>  a. the user asked to do something that isn't sensible, and we kindly
>     rebuked the user
> 
>  b. we contacted a server, and the server was not happy with our
>     request
> 
>  c. the local Git repository is corrupt
> 
>  d. we ran out of resources (e.g., disk space)
> 
>  e. we encountered an internal error in handling the user's
>     legitimate request

I've run into this problem, too. If you run a website that runs Git
commands on behalf of users and try to get metrics on failing exit
codes, it's hard to tell the difference between "the repo is broken",
"Git has a bug", "the user (or other caller) asked for something
stupid", and "some transient error occurred".

But I'm not sure that even Git can always tell the difference between
those things. Some real-world examples I've run into:

  - "rev-list $oid" can't find object $oid. Is the repo corrupt? Or is
    the caller unreasonable to ask for that object? Or was there a race
    or other transient error which made the object invisible?

  - upload-pack is writing out a packfile, but gets EPIPE. Did the
    network drop out? Or is a Git bug causing one side to break
    protocol?

Some rough categorization may help, but a lot of those need to propagate
the specific errors back to the caller. For instance, the rev-list
example could be FAILED_PRECONDITION in your terminology. But really, we
want to tell the caller "the object you asked for doesn't exist". And
then it can decide if that was user error (somebody hitting a URL for an
object that we have no reason to think exists), or a sign of problems
elsewhere in the system (if we just got $oid from Git, we expect it to
be there).

So it seems like the most useful thing is specific error codes for
specific cases. And that gets very daunting to think about annotating
and communicating about each such case (we don't even pass that level of
detailed information inside the program in a machine-readable way;
scraping stderr is the best way to figure this stuff out now).

I dunno. Maybe a rougher categorization would help your case, but not
mine. But I'm a bit skeptical that we'll have enough coverage of various
conditions to be useful, and that it won't turn into a headache trying
to categorize everything.

-Peff