On Tue, Dec 28 2021, Jeff Hostetler wrote: > On 12/27/21 6:01 PM, Ævar Arnfjörð Bjarmason wrote: >> On Mon, Dec 27 2021, Jeff Hostetler wrote: >> >>> On 11/15/21 5:18 PM, Ævar Arnfjörð Bjarmason wrote: >>> [...] >>>> It might be a good change to remove the "fmt" key from the "error" >>>> events as a follow-up change. As these few examples from running the >>>> test suite show it's sometimes redundant (same as the "msg"), rather >>>> useless (just a "%s"), or something we could now mostly aggregate by >>>> file/line instead of the normalized printf format: >>>> 1 file":"builtin/gc.c","line":1391,"msg":"'bogus' is not a >>>> valid task","fmt":"'%s' is not a valid task"} >>>> 1 file":"builtin/for-each-ref.c","line":89,"msg":"format: %(then) atom used more than once","fmt":"%s"} >>>> 1 file":"builtin/fast-import.c","line":411,"msg":"Garbage after mark: N :202 :302x","fmt":"Garbage after mark: %s"} >>>> "Mostly" here assumes that it would be OK if the aggregation changed >>>> between git versions, which may be what all users of trace2 want. The >>>> change that introduced the "fmt" code was ee4512ed481 (trace2: create >>>> new combined trace facility, 2019-02-22), and the documentation change >>>> was e544221d97a (trace2: Documentation/technical/api-trace2.txt, >>>> 2019-02-22). >>>> Both are rather vague on what problem "fmt" solved exactly, aside >>>> from >>>> the obvious one of it being impossible to do meaningful aggregations >>>> due to the "file" and "line" being the same everywhere, which isn't >>>> the case now. >>>> In any case, let's leave "fmt" be for now, the above summary was >>>> given >>>> in case it's interesting to remove it in the future, e.g. to save >>>> space in trace2 payloads. >>> >>> I added the "fmt" field so that we could do aggregations >>> of error messages across multiple users without regard >>> to what branch or filename or percentage or whatever was >>> formatted into the actual "msg" written to stderr. >>> >>> The actual file:line wasn't useful (primarily because it >>> was probably something in usage.c), but even if we fix that >>> it might not be useful if we have users running 10 different >>> versions of Git (because some people don't upgrade immediately). >>> >>> So I'd rather not kill it right now. >> Thanks. I'm not trying to kill it, but just poking at what it was >> for >> exactly. >> Depending on the answer to that perhaps we didn't need it anymore, >> but >> the explanation you provide (mostly) makes sense. >> The "mostly" being because I'm assuming that you only need to deal >> with >> LC_ALL=C users? >> I.e. the documented promise that you can group things by "fmt" >> doesn't >> hold if you're processing even streams from users who are using a >> translated git, because we'll get the translated format string, not the >> original. > > I just did a query on the data we've collected over the last > few weeks and there are only English error messages in the > database, so yes LC_ALL=C seems to be the norm. Ah, that explains it. >> For that we'd need to change the API from/to to: >> - error(_("some format %s"), ...) >> + error(N_("some format %s"), ...) > > So no, I don't think it is worth the complexity to change > this. Besides, wouldn't you need to more machinery under > the hood -- to emit the untranslated string to trace2 and > the translated string to stderr? (As in, move the translation > down a layer??) Yes, hence N_(), it marks the string for translation, but doesn't translate it. So the underlying function would call _() for what we emit to stderr, but not for the "fmt" field. > My "fmt" field is not worth that effort. Probably not to you because you're deploying git into a monolingual environment, but for anyone who is they'd need to maintain manual groupings of messages by scraping our po/*.po files. > And besides, my goal was only to get the "top 10 or 20 errors" > across a large set of users. I guess I'm thinking of it as a > sample rather than an exhaustive list, so it is OK if we don't > capture the translated strings. I suppose with enough users it wouldn't matter either way, but you'd get quite a bit of fragmentation. You'd also have errors that don't differ between translations (e.g. those "%s" cases) amplified in count, as they won't fragment due to i18n. > Something else to consider is the GDPR. The "fmt" string is > generic (e.g. "path '%s' exists on disk, but not in the index") > but doesn't leak an PII or otherwise sensitive data. Whereas > the corresponding "msg" field does include the pathname in this > example. So if someone is post-processing the data and aggregating, > they may want to relay only the "fmt" field and not the "msg" field > to their database. Indeed. > (Granted, there are lots of PII and GDPR problematic fields in > the data stream that a post-processor would need to be aware of, > but all of that is outside of the scope of the Trace2 logging. > I only mention it here because the "fmt" field may be useful for > reasons not previously discussed.) I do think it would be a good thing to work on, i.e. to make the logging less verbose, which as a side-effect would make it GDPR compliant.