Re: [RFC PATCH 19/21] usage API: use C99 macros for {usage,usagef,die,error,warning,die}*()

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 29 Dec 2021 00:42:48 +0100

On Tue, Dec 28 2021, Jeff Hostetler wrote:

> On 12/27/21 6:01 PM, Ævar Arnfjörð Bjarmason wrote:
>> On Mon, Dec 27 2021, Jeff Hostetler wrote:
>> 
>>> On 11/15/21 5:18 PM, Ævar Arnfjörð Bjarmason wrote:
>>> [...]
>>>> It might be a good change to remove the "fmt" key from the "error"
>>>> events as a follow-up change. As these few examples from running the
>>>> test suite show it's sometimes redundant (same as the "msg"), rather
>>>> useless (just a "%s"), or something we could now mostly aggregate by
>>>> file/line instead of the normalized printf format:
>>>>         1 file":"builtin/gc.c","line":1391,"msg":"'bogus' is not a
>>>> valid task","fmt":"'%s' is not a valid task"}
>>>>         1 file":"builtin/for-each-ref.c","line":89,"msg":"format: %(then) atom used more than once","fmt":"%s"}
>>>>         1 file":"builtin/fast-import.c","line":411,"msg":"Garbage after mark: N :202 :302x","fmt":"Garbage after mark: %s"}
>>>> "Mostly" here assumes that it would be OK if the aggregation changed
>>>> between git versions, which may be what all users of trace2 want. The
>>>> change that introduced the "fmt" code was ee4512ed481 (trace2: create
>>>> new combined trace facility, 2019-02-22), and the documentation change
>>>> was e544221d97a (trace2: Documentation/technical/api-trace2.txt,
>>>> 2019-02-22).
>>>> Both are rather vague on what problem "fmt" solved exactly, aside
>>>> from
>>>> the obvious one of it being impossible to do meaningful aggregations
>>>> due to the "file" and "line" being the same everywhere, which isn't
>>>> the case now.
>>>> In any case, let's leave "fmt" be for now, the above summary was
>>>> given
>>>> in case it's interesting to remove it in the future, e.g. to save
>>>> space in trace2 payloads.
>>>
>>> I added the "fmt" field so that we could do aggregations
>>> of error messages across multiple users without regard
>>> to what branch or filename or percentage or whatever was
>>> formatted into the actual "msg" written to stderr.
>>>
>>> The actual file:line wasn't useful (primarily because it
>>> was probably something in usage.c), but even if we fix that
>>> it might not be useful if we have users running 10 different
>>> versions of Git (because some people don't upgrade immediately).
>>>
>>> So I'd rather not kill it right now.
>> Thanks. I'm not trying to kill it, but just poking at what it was
>> for
>> exactly.
>> Depending on the answer to that perhaps we didn't need it anymore,
>> but
>> the explanation you provide (mostly) makes sense.
>> The "mostly" being because I'm assuming that you only need to deal
>> with
>> LC_ALL=C users?
>> I.e. the documented promise that you can group things by "fmt"
>> doesn't
>> hold if you're processing even streams from users who are using a
>> translated git, because we'll get the translated format string, not the
>> original.
>
> I just did a query on the data we've collected over the last
> few weeks and there are only English error messages in the
> database, so yes LC_ALL=C seems to be the norm.

Ah, that explains it.

>> For that we'd need to change the API from/to to:
>>      - error(_("some format %s"), ...)
>>      + error(N_("some format %s"), ...)
>
> So no, I don't think it is worth the complexity to change
> this.  Besides, wouldn't you need to more machinery under
> the hood -- to emit the untranslated string to trace2 and
> the translated string to stderr?  (As in, move the translation
> down a layer??)

Yes, hence N_(), it marks the string for translation, but doesn't
translate it. So the underlying function would call _() for what we emit
to stderr, but not for the "fmt" field.

> My "fmt" field is not worth that effort.

Probably not to you because you're deploying git into a monolingual
environment, but for anyone who is they'd need to maintain manual
groupings of messages by scraping our po/*.po files.

> And besides, my goal was only to get the "top 10 or 20 errors"
> across a large set of users.  I guess I'm thinking of it as a
> sample rather than an exhaustive list, so it is OK if we don't
> capture the translated strings.

I suppose with enough users it wouldn't matter either way, but you'd get
quite a bit of fragmentation. You'd also have errors that don't differ
between translations (e.g. those "%s" cases) amplified in count, as they
won't fragment due to i18n.

> Something else to consider is the GDPR.  The "fmt" string is
> generic (e.g. "path '%s' exists on disk, but not in the index")
> but doesn't leak an PII or otherwise sensitive data.  Whereas
> the corresponding "msg" field does include the pathname in this
> example.  So if someone is post-processing the data and aggregating,
> they may want to relay only the "fmt" field and not the "msg" field
> to their database.

Indeed.

> (Granted, there are lots of PII and GDPR problematic fields in
> the data stream that a post-processor would need to be aware of,
> but all of that is outside of the scope of the Trace2 logging.
> I only mention it here because the "fmt" field may be useful for
> reasons not previously discussed.)

I do think it would be a good thing to work on, i.e. to make the logging
less verbose, which as a side-effect would make it GDPR compliant.