Hi all, We are currently replaying a 15-year SVN history into Git -- with contributions from thousands of developers -- and are faced with the challenge of corporate email recycling, departures, re-hires, and name changes causing identity issues. * Background As you know (also to validate my own knowledge/assumptions), a Git commit stores identity as a name and an email. The only means to validate this information is via signing; commits are otherwise taken at face-value. This seems pretty core to Git's decentralized design. So to identify who is responsible for a commit, you have only the two name+email pairs. The problem in a nutshell: names and emails change over time. The simple cases can be handled by gitmailmap, but there are more challenging cases: - A commit author might have had some email <one@xxxxxxxx>, but then was able to 'upgrade' to <two@xxxxxxxx> after a departure. - It's even possible that this departure might 'boomerang' and return to their old job, albeit now with a different email (since they forfeited <two@xxxxxxxx> upon departure). In effect, the email address (or even email+name pair) used by a given developer is not enough to identify that developer. This issue is exacerbated by the features of some Git forges (e.g. GitHub, GitLab, etc.) that will map an email address to a user account. In the degenerate case, this would cause the forge to attribute the commit incorrectly -- causing communication issues as developers use the misattribution to reach out to the wrong people. As a baseline, we know the following statements are true: 1. A person's preferred name can change at any time. 2. A person's preferred email can change at any time. 3. Neither of these pieces of information are necessarily identifying in a given codebase. * Current Options Setting aside the dubious practice of email recycling, how should we look at resolving this confusion in a sensible way? I see three general options that are possible today, each with their drawbacks: 1. Do nothing. Leave it to the developer to determine the correct contact information without assistance. This doesn't really resolve the confusion, but it is technically an option. 2. Use gitmailmap(5) functionality to resolve historical emails to primary emails. Sadly this doesn't actually solve the email recycling problem. Since one email could be used by multiple developers, there's no way (that I can see) to use a single mailmap file to resolve one of these emails to a single person. 3. Use and require commit signing -- using some separate system to keep track of who used what public key when (valid-before/-after). This is an attractive option and very much fits the 'identity' portion of this problem, but evidently, it's not yet supported by git-fast-import. This becomes a non-starter for at least all our SVN history we're importing over. Even a potential concept of providing personal private keys to an impersonal import process also seems less than appealing, but I'm open to the possibility that I'm being too paranoid here. At this point, I'd like to ask the community what approaches have already proven successful. I have a design sketch below, but I would not want to propose introducing a new standard if a standard already exists. I tend to think this is not the first time this problem has come up. * Bad Proposal: Using Mailmap Over-Time In the absence of other options, this leaves us to consider another: if the mailmap file is tracked in history, we can know who had which email when. Taking advantage of this fact right now is a bit roundabout, but workable. Using the `mailmap.blob` config and pointing to the mailmap version as of a given commit yields behavior that *looks* promising: we can resolve an email in a commit to the person who had that email when the commit was made. Mentally extending this concept to do this automatically in git-log and friends, however, shows that this wholesale removes the value of mailmap: the ability to change your name as it appears in history *after* that history is created. More details at [1]. There are of course legitimate reasons for a developer to change their name and desire that name to be used throughout the history. It would seem then that current mailmap functionality is insufficient to create an external system that solves the problem. * Proposal: UUIDs To get what we want (i.e., the ability to run `git show HEAD~1`, know that Ada wrote it, and report her current contact information), we need some way of tracking identity over time. A naive solution could be to extend the mailmap format as recognized by Git: $ git cat-file blob HEAD~1:.mailmap A. U. Thor <foo@xxxxxxxxxxx> [uuid A] <ada@xxxxxxxxxxx> $ git cat-file blob HEAD:.mailmap A. U. Thor <ada@xxxxxxxxxxx> [uuid A] Roy G. Biv <foo@xxxxxxxxxxx> [uuid B] <roy@xxxxxxxxxxx> Now, when I run `git show HEAD~1`, Git would determine the UUID of the email on the commit using the mailmap version in that tree: $ git -c mailmap.blob=HEAD~1:.mailmap check-mailmap --uuid "<foo@xxxxxxxxxxx>" A Then, we can use that UUID to resolve to the current contact information: $ git check-mailmap --uuid=A A. U. Thor <ada@xxxxxxxxxxx> Mailmap-sensitive commands can use this logic internally -- possibly guarded under some new config setting. ** Criticisms Main criticisms of this proposal that I can think of off-hand: 1. As far as I know, the mailmap format is pretty well-established. I don't know how additions/extensions to the format will be interpreted by other tools. This could be mitigated with a config option or even a magic comment in the file itself noting its format version. 2. This also assumes that any given email can belong to at most one person at a time. This is true for us, but may not be generally true. I don't know if this is a new assumption for mailmap. I don't know of any mitigation here. 3. The current mailmap format of 'one line per pair' potentially opens up the format to ambiguities when resolving an email address -- ambiguities that become more apparent with the introduction of a UUID. For example: A. U. Thor <foo@xxxxxxxxxxx> [uuid A] <ada@xxxxxxxxxxx> Ada Thor <ada@xxxxxxxxxxx> [uuid A] <foo@xxxxxxxxxxx> When evaluating `git check-mailmap --uuid=A`, which line is used? Perhaps this is not a new problem, but it is a disconcerting one. This might be resolved by allowing many aliases on a single line and restricting UUIDs to be unique in the file. Since the introduction of UUIDs would be a change to the format that by definition would need to be opted-into (to provide the UUIDs), this would not be a breaking change in and of itself. I am additionally unsure of how the following current behavior (from gitmailmap(5)) should play into this proposal -- mostly because I don't understand the use-case to carve out a specific author/committer name for the mapping: > [...] and: > > Proper Name <proper@xxxxxxxx> Commit Name <commit@xxxxxxxx> > > which allows mailmap to replace both the name and the email of a > commit matching both the specified commit name and email > address. * Proposal: Valid-Before/Valid-After Another potential idea is to record a transition point in the mailmap file: A. U. Thor <foo@xxxxxxxxxxx> <ada@xxxxxxxxxxx> valid-before=<timestamp> A. U. Thor <ada@xxxxxxxxxxx> This draws on the valid-before/-after pattern used in commit signing. While I've signed a few commits in my day, I'm admittedly not versed on that infrastructure/implementation. I'd be curious if this idea would have a champion to argue for it. I personally prefer UUIDs. ** Criticisms There are a few things I can identify: 1. While it's potentially good enough to solve the immediate problem, this doesn't actually establish a tangible identity. 2. Parsing this as a human being could become difficult if there are a few transition points. * Summary All of this is under the assumption that there is no viable approach out there, which I'm not convinced of. This proposal is only a suggestion of how it *could* be standardized with development if no standard practice currently exists. So, two questions remain: 1. Is there a standard approach out there already that was not discussed above? (Are there statements I made about those approaches that are not true?) 2. Assuming there is no suitable standard, is the above proposal worth investing development time, or are there fundamental flaws? Thanks, Sean Allred --- [1]: See below. I removed this from the main body to try and control its length. * Failed Plan: using mailmap over-time (I'm going to say things in the course of this example that are 'wrong' and/or 'working as designed', so bear with me!) Consider the following. At some point in the past, <foo@xxxxxxxxxxx> belonged to Ada. She wrote the following commit: $ git cat-file commit HEAD~1 ... author A. U. Thor <foo@xxxxxxxxxxx> ... committer A. U. Thor <foo@xxxxxxxxxxx> ... $ git cat-file blob HEAD~1:.mailmap A. U. Thor <foo@xxxxxxxxxxx> <ada@xxxxxxxxxxx> Somewhere down the line, Ada has left, <foo@xxxxxxxxxxx> transferred to Roy, and he wrote the following commit: $ git cat-file commit HEAD ... author Roy G. Biv <foo@xxxxxxxxxxx> ... committer Roy G. Biv <foo@xxxxxxxxxxx> ... $ git cat-file blob HEAD:.mailmap A. U. Thor <ada@xxxxxxxxxxx> Roy G. Biv <foo@xxxxxxxxxxx> <roy@xxxxxxxxxxx> If we check the mailmap to resolve <foo@xxxxxxxxxxx> from HEAD~1, we get the wrong answer: $ git show HEAD~1 ... Author: Roy G. Biv <foo@xxxxxxxxxxx> ... It's only when we provide the version of the mailmap file that was active at the time do we get the right answer: $ git -c mailmap.blob=HEAD~1:.mailmap show HEAD~1 ... Author: A. U. Thor <foo@xxxxxxxxxxx> ... So, if we can instruct git-show and friends to check the mailmap version at the time of the commit, we get what appears to be the desired behavior. Great, right? Not so fast. This loses sight of one of the main purposes of gitmailmap. This idea failed. You're at the end now; thanks for reading :-) -- Sean Allred