Re: Dealing with corporate email recycling

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Sun, 13 Mar 2022 16:51:05 +0100

On Sat, Mar 12 2022, Sean Allred wrote:

> We are currently replaying a 15-year SVN history into Git -- with
> contributions from thousands of developers -- and are faced with the
> challenge of corporate email recycling, departures, re-hires, and name
> changes causing identity issues.
>
> * Background
>
> As you know (also to validate my own knowledge/assumptions), a Git
> commit stores identity as a name and an email.  The only means to
> validate this information is via signing; commits are otherwise taken
> at face-value.  This seems pretty core to Git's decentralized design.
> So to identify who is responsible for a commit, you have only the two
> name+email pairs.
>
> The problem in a nutshell: names and emails change over time.  The
> simple cases can be handled by gitmailmap, but there are more
> challenging cases:
>
>   - A commit author might have had some email <one@xxxxxxxx>, but then
>     was able to 'upgrade' to <two@xxxxxxxx> after a departure.
>
>   - It's even possible that this departure might 'boomerang' and
>     return to their old job, albeit now with a different email (since
>     they forfeited <two@xxxxxxxx> upon departure).
> [...skip a bunch of details...]
> You're at the end now; thanks for reading :-)

Aside from technical solutions and twists on mailmap, you haven't
*really* described what practical problem you're facing here.

> Somewhere down the line, Ada has left, <foo@xxxxxxxxxxx> transferred to
> Roy, and he wrote the following commit:

I.e. this, sure, that can happen, but what's the negative effect of that
in practice?

I've been involved in similar migrations in the past, and the primary
way to deal with it was to mostly ignore it, especially in a corporate
setting.

I.e. sure, you'll have some edge cases here and there, but the value of
knowing who exactly authored something tend to be proportional to how
recent the commit is.

If someone wrote something 10 years ago they're probably not even
working there anymore, or if they are will long since have forgotten
what they need to know to answer any specific questions etc.

The only people who tend to look at it are developers using "git blame"
or something, and usually humans are smart enough to spot that even if
it's foo@xxxxxxxxxxx they were expecting Roy, not Ada, or the other way
around.

Side note: To the extent that I've had to deal with this (in a corporate
setting) I found myself wanting git to have the exact opposite,
i.e. some feature where we'd just hide the author for anything any work
that's >5 years old or whatever.

Not for any privacy reason, but just because some UI's wouldn't really
communicate (in a way that people actually noticed) that the relevant
work was ancient, and someone who'd since long-moved-on would get
occasional interruptions due to ancient code they wrote but weren't
equipped to currently maintain.

Or similarly, to have anything >N years old "git blame" to the team
currently maintaining that thing, not to the person.

But I digress.

Having said that I think if you do need such a back-annotated history
you should look into "git notes" and/or "git replace". I.e. you could
have some lookup system maintain a mapping from OIDs to current IDs.

I've implemented a system like that in the past (in a MySQL table, but
whatever). I'd think this use-case of perfectly annotated old history is
probably obscure enough that that's the primary thing we should steer
people towards...

>   1. As far as I know, the mailmap format is pretty well-established.
>      I don't know how additions/extensions to the format will be
>      interpreted by other tools.

It's perfectly OK to change parts of that format in
backwards-"incompatible" ways, i.e. there's enough leeway in the
existing format definition and in-the-wild readers to have new readers
pick up new information that old readers will ignore.

I.e. we simply ignore things we can't map now, so one way to do it is to
start with something that produces an invalid (but harmless) mapping to
current readers, another is to borrow a trick from "/etc/sudoers" and
(ab)use the comment syntax.