On Sat, Mar 12 2022, Sean Allred wrote: > We are currently replaying a 15-year SVN history into Git -- with > contributions from thousands of developers -- and are faced with the > challenge of corporate email recycling, departures, re-hires, and name > changes causing identity issues. > > * Background > > As you know (also to validate my own knowledge/assumptions), a Git > commit stores identity as a name and an email. The only means to > validate this information is via signing; commits are otherwise taken > at face-value. This seems pretty core to Git's decentralized design. > So to identify who is responsible for a commit, you have only the two > name+email pairs. > > The problem in a nutshell: names and emails change over time. The > simple cases can be handled by gitmailmap, but there are more > challenging cases: > > - A commit author might have had some email <one@xxxxxxxx>, but then > was able to 'upgrade' to <two@xxxxxxxx> after a departure. > > - It's even possible that this departure might 'boomerang' and > return to their old job, albeit now with a different email (since > they forfeited <two@xxxxxxxx> upon departure). > [...skip a bunch of details...] > You're at the end now; thanks for reading :-) Aside from technical solutions and twists on mailmap, you haven't *really* described what practical problem you're facing here. > Somewhere down the line, Ada has left, <foo@xxxxxxxxxxx> transferred to > Roy, and he wrote the following commit: I.e. this, sure, that can happen, but what's the negative effect of that in practice? I've been involved in similar migrations in the past, and the primary way to deal with it was to mostly ignore it, especially in a corporate setting. I.e. sure, you'll have some edge cases here and there, but the value of knowing who exactly authored something tend to be proportional to how recent the commit is. If someone wrote something 10 years ago they're probably not even working there anymore, or if they are will long since have forgotten what they need to know to answer any specific questions etc. The only people who tend to look at it are developers using "git blame" or something, and usually humans are smart enough to spot that even if it's foo@xxxxxxxxxxx they were expecting Roy, not Ada, or the other way around. Side note: To the extent that I've had to deal with this (in a corporate setting) I found myself wanting git to have the exact opposite, i.e. some feature where we'd just hide the author for anything any work that's >5 years old or whatever. Not for any privacy reason, but just because some UI's wouldn't really communicate (in a way that people actually noticed) that the relevant work was ancient, and someone who'd since long-moved-on would get occasional interruptions due to ancient code they wrote but weren't equipped to currently maintain. Or similarly, to have anything >N years old "git blame" to the team currently maintaining that thing, not to the person. But I digress. Having said that I think if you do need such a back-annotated history you should look into "git notes" and/or "git replace". I.e. you could have some lookup system maintain a mapping from OIDs to current IDs. I've implemented a system like that in the past (in a MySQL table, but whatever). I'd think this use-case of perfectly annotated old history is probably obscure enough that that's the primary thing we should steer people towards... > 1. As far as I know, the mailmap format is pretty well-established. > I don't know how additions/extensions to the format will be > interpreted by other tools. It's perfectly OK to change parts of that format in backwards-"incompatible" ways, i.e. there's enough leeway in the existing format definition and in-the-wild readers to have new readers pick up new information that old readers will ignore. I.e. we simply ignore things we can't map now, so one way to do it is to start with something that produces an invalid (but harmless) mapping to current readers, another is to borrow a trick from "/etc/sudoers" and (ab)use the comment syntax.