Oddidies in the .mailmap parser & future syntax extensions

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Fri, 10 Sep 2021 18:48:26 +0200

[Changed subject]

On Fri, Sep 10 2021, Gwyneth Morgan wrote:

> On 2021-09-10 14:02:36+0100, Fangyi Zhou wrote:
>> Similar to a35b13fce0 (Update .mailmap, 2018-11-09).
>> 
>> This patch makes the output of `git shortlog -nse v2.10.0..master`
>> duplicate-free by taking/guessing the current and preferred
>> addresses for authors that appear with more than one address.
>
> The line for Jessica Clarke should probably just be
>
> Jessica Clarke <jrtc27@xxxxxxxxxx>
>
> That works the same and doesn't put a reference to an old name.

It does work exactly the same!

More specifically this is an unintentional bug/misfeature/looseness in
the .mailmap parser, an entry like:

    Foo <foo@xxxxxxxxxxx> Bar

Is exactly equivalent to:

    Foo <foo@xxxxxxxxxxx>

I.e. we simply ignore the " Bar" part. The reason for this is that we're
internally treating nonsense input as if the line simply ended there.

Even having documented and tested some of this recently in 05b5ff219c2
(mailmap doc + tests: add better examples & test them, 2021-01-12) I
found this a bit surprising. I probably found out at the time, but
forgot and had to go source spelunking again.

I'd expect:

    Foo <foo@xxxxxxxxxxx> Bar

To be an alias/shorthand for:

    Foo <foo@xxxxxxxxxxx> Bar <foo@xxxxxxxxxxx>

Which is something that might be applicable / useful in some
cases.

E.g. a name might change over time from "Foo", to "Bar", to "Zar", but
just because we're at "Bar" and want to map "Foo" to "Bar", that might
not mean that we'd like to map any future name at the same address
(i.e. the future "Zar") to the same "Foo".

In practice I suspect that's more commonly what people do want to do,
maybe we should warn about it, I did mean to hook some pedantic mode of
the parser at some point up to git-fsck.

More annoying is that this:

    New <foo@xxxxxxxxxxx> <bar@xxxxxxxxxxx>
    <foo@xxxxxxxxxxx> <zar@xxxxxxxxxxx>

Doesn't mean the same as:

    New <foo@xxxxxxxxxxx> <bar@xxxxxxxxxxx>
    New <foo@xxxxxxxxxxx> <zar@xxxxxxxxxxx>

I.e. I'd expect the name to map to the empty string, *unless* we saw an
earlier address, i.e. just as we do for the first bar -> foo line (we
map it to a name of "New", we don't map it to an empty name).

So that's some #leftoverbits, perhaps someone somewhere relies on that,
but it seems like an obvious shorthand to have. I can't imagine it being
useful to map to empty names, and much of e.g. git.git's mailmap is
repeated entries with the same name over and over again.

I suppose we could also extend it to new syntax such as:

    New <foo@xxxxxxxxxxx> <bar@xxxxxxxxxxx> <zar@xxxxxxxxxxx>

Doing that would be strictly backwards compatible, i.e. now we'll
entirely ignore the 3rd E-Mail address. It does mean we also
accidentally support things like:

    New <foo@xxxxxxxxxxx> <bar@xxxxxxxxxxx> # A comment, because we ignore everything after the 2nd address

But don't tell anyone I told you that :) But that is something that
might technically have inadvertently closed the door to future syntax
extensions, but we could probably do them anyway, or at worst have some
heuristic.

Another useful thing might be to support:

    New <> Old <>

As an explicit mapping of the name "Old" wherever we see it to "New", or:

    New <> Old <>

To change just the name "Old" to "New" everywhere, without considering
the E-Mail address. Both of those are probably too crazy to be useful,
especially since if we supported that we'd logically also support:

    New <> <>

To assign all the commits to the name "New", but retain the address.