On Sun, Jan 03 2021, brian m. carlson wrote: I think it makes sense to split up 1-4/5 here from 5/5 in this series since they're really unrelated changes, although due to the changes in 1-4 they'll conflict. > Many people, through the course of their lives, will change either a > name or an email address. For this reason, we have the mailmap, to map > from a user's former name or email address to their current, canonical > forms. Normally, this works well as it is. > > However, sometimes people change a name or an email address and wish to > wholly disassociate themselves from that former name or email address. > For example, a person may transition from one gender to another, > changing their name, or they may have changed their name to disassociate > themselves from an abusive family or partner. In such a case, using the > former name or address in any way may be undesirable and the person may > wish to replace it as completely as possible. The cover letter noted "As mentioned in the original thread, I think a hash rather than an encoding is the right choice here.". Reading the v1 I think you're referring to https://lore.kernel.org/git/X9wUGaR3IXcpV0nT@xxxxxxxxxxxxxxxxxxxxxxxxx/ In v1 I pointed out you needed to read some combination of the cover letter & the patch to see what this was intended for (see [1]). I think for v3 the commit itself should summarize the trade-offs & design choices. > For projects which wish to support this, introduce hashed forms into the > mailmap. These forms, which start with "@sha256:" followed by a SHA-256 > hash of the entry, can be used in place of the form used in the commit > field. This form is intentionally designed to be unlikely to conflict > with legitimate use cases. For example, this is not a valid email > address according to RFC 5322. In the unlikely event that a user has > put such a form into the actual commit as their name, we will accept it. We'll emit the commit author information as-is in that case under "git show", or run the mapping and map it via mailmap? Anyway, it seems there's a test for this. Probably better to just point to it. > While the form of the data is designed to accept multiple hash > algorithms, we intentionally do not support SHA-1. There is little > reason to support such a weak algorithm in new use cases and no > backwards compatibility to consider. Moreover, SHA-256 is faster than > the SHA1DC implementation we use, so this not only improves performance, > but simplifies the current implementation somewhat as well. I agree with most of this aside from the "weak algorithm" part. That seems like an irrelevant aside for this specific use of a hashing algorithm, no? We could even use MD5 here, so SHA256-only is just setting is up for not needing to deal with SHA1 forever in this one place in some SHA256 future repo. > Note that it is, of course, possible to perform a lookup on all commit > objects to determine the actual entry which matches the hashed form of > the data. However, this is an improvement over the status quo. > > The performance of this patch with no hashed entries is very similar to > the performance without this patch. Considering a git log command to > look up author and committer information on 981,680 commits in the Linux > kernel history, either with an unhashed mailmap or a mailmap with all > old values hashed: > > Shortest Longest Average Change > Git 2.30 7.876 8.297 8.143 > This patch, unhashed 7.923 8.484 8.237 + 1.15% > This patch, hashed 14.510 14.783 14.672 +80.17% > This patch, hashed, unoptimized 15.425 16.318 15.901 +95.27% > > Thus, the average performance after this patch is within normal > variation of the pre-patch performance. It's unlikely that users will > notice the difference in practice, even on much larger > repositories, unless they're using the new feature. Am I reading this right that if there's a single hashed entry in .mailmap anything using %aE or %aN is around 2x as slow? Your v1 mentioned that a project might "insert entries for many contributors in order to make discovery of "interesting" entries significantly less convenient." which is gone in the v2 patch. As noted in [1] I don't see how it helps the obscurity much, but if that's still the intended use we'd expect to get more slowdowns in the wild if users intend to convert their whole mailmap to this form if they want a single entry to use the form. Anyway, as you might have guessed I'm still not a fan of this direction. But most of it is because I honestly don't get why this specific approach is required to achieve the stated aims, there's a few of them, so here's an attempt to break them down: 1. User changed their name and doesn't want themselves or others to see their old name For the case where Joe Developer is now known as Jane Doe in most cases you don't need to put the old name at all into the .mailmap. E.g. for git.git this patch to our .mailmap produces the same output for `log --all --pretty="%h %an%ae%aN%aE"`: brian m. carlson <sandals@xxxxxxxxxxxxxxxxxxxx> -brian m. carlson <sandals@xxxxxxxxxxxxxxxxxxxx> <sandals@xxxxxxxxxxxxxxxxxxxxxxx> -brian m. carlson <sandals@xxxxxxxxxxxxxxxxxxxx> <bk2204@xxxxxxxxxx> +<sandals@xxxxxxxxxxxxxxxxxxxx> <sandals@xxxxxxxxxxxxxxxxxxxxxxx> +<sandals@xxxxxxxxxxxxxxxxxxxx> <bk2204@xxxxxxxxxx> So the new->name/email mapping (as opposed to new->email) is really only needed for some really obscure cases where two people shared an E-Mail or something. So we're talking about hiding the old E-Mail, presumably because it was joe@ intsead of jane@, so in that case we could just support URI encoding: Jane Doe <jane@xxxxxxxxxxx> <jane@xxxxxxxxxxx> <%6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D> Made via: $ perl -MURI::Escape=uri_escape -wE 'say uri_escape q[joe@xxxxxxxxxxxxx], "^@."' %6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D Which also has the nice attribute that people can make it obvious what part they want to hide, since this is really a feature to enable social politeness & consideration: Jane Doe <jane@xxxxxxxxxxx> # I don't want to be known by my old name, thanks <jane@xxxxxxxxxxx> <%6A%6F%65@xxxxxxxxxxxxx> 2. Hiding from your enemies For the other use-case of "abusive family or partner" I had the comment in v1 of "but not so much that you'd still take the risk of submitting a patch to .mailmap?". Now that's obviously phrased in an off-the-cuff manner, but I'm serious. I think it is important that the non-security of this feature obviously looks like some trivial encoding, because that's what it is. People get lulled into a false sense of security with these things all the time (e.g. thinking their "Authorization" HTTP header is safe to post on a public pastebin). So we should as much as possible make this look like the non-security it is. 3. Enabling people not to treat .mailmap as binary or a multi-encoding file. I mentioned this in my [1]. Your implementation doesn't do this, but e.g. it would be very nice for a project that switched from latin-1 to utf-8 to be able to do, in some cases: # Made with: perl -MURI::Escape=uri_escape -wE 'say uri_escape "@ARGV", "^a-z@. "' $(echo Ævar Arnfjörð Bjarmason | iconv -f utf-8 -t iso-8859-1) # Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> %C6var %41rnfj%F6r%F0 %42jarmason <avarab@xxxxxxxxx> Or some combination thereof, so e.g. previously Big5/latin1 who migrated to UTF-8 don't need to have non-valid UTF-8 in .mailmap 4. Spam You mentioned this in your [2] (but not as a use-case in the v2 re-rolled commit message): And we know that spammers and recruiters (which, in this case, are also spammers) do indeed scrape repositories via the repository web interfaces. Surely these people are most interested in the current E-Mail addresses, which if they're scraping the common web interfaces (e.g. Github, GitLab) are easily accessible there. It doesn't seem very plausible that someone would care enough to scrape .mailmap for old addresses but not just update their scraper to clone & run "git log" for the purposes of e.g. their recruitment E-Mails. 5. Interaction with other systems Something I mentioned in the last 3 paragraphs of my [1]. I think you're only considering the cases where git itself does the mailmap translation, but we have 3rd party systems that make use of the format in good ways (also doing the Joe->Jane mapping). Making it a hassle for those systems makes it more likely that Jane doesn't get the mapping she wants. 1. https://lore.kernel.org/git/87eejswql6.fsf@xxxxxxxxxxxxxxxxxxx/ 2. https://lore.kernel.org/git/X9wUGaR3IXcpV0nT@xxxxxxxxxxxxxxxxxxxxxxxxx/