On Wed, Sep 14 2022, Florine W. Dekker wrote: > On 14/09/2022 09:40, René Scharfe wrote: >> Am 13.09.22 um 23:53 schrieb Florine W. Dekker: >>> Now, John can now add the following line to their mailmap config: >>> `John Doe <john.doe@xxxxxxxxxxx> <\*.doe@xxxxxxxxxxx>`, which does >>> not reveal their old name. >> That would falsely attribute the work of possible future developers >> ann.doe@xxxxxxxxxxx and bob.doe@xxxxxxxxxxx to John as well. First, I'm very happy to see that someone has picked up the thread on this again. > Good point. I assumed such false positives would be unlikely because I > was considering very-small-scale projects, but I agree that using > wildcards is not at all feasible for larger projects. Yes, please, making the mapping fuzzy in any way is really going against the core design of the mailmap mechanism, it should be unambiguous, *also* for commits going forward. >> Supporting hashed entries would allow for a more targeted obfuscation. >> That was discussed a while ago: >> https://lore.kernel.org/git/20210103211849.2691287-1-sandals@xxxxxxxxxxxxxxxxxxxx/ > > That was an interesting read. I agree with Ævar in that thread in that > I think URL encoding is sufficient. I think it meets Brian's use case > of never having to see the old name again, and my use case of > obfuscating it from accidental discovery by friendly > collaborators. The question that was left open in my mind after that previous discussion was weather people who wanted the "deadname" feature would find this acceptable, I don't think we got any explicit ACK/NACK on that (but I may be misrecalling, and didn't go back & re-read the whole thing). I'm happy that there's at least one ACK to it here in the form of your reply, and hopefully that represents what a wider audience would prefer. > While a hash certainly gives a stronger sense of > security, I think it's a false sense of security, because, as you note > below, recovering old email addresses from the tree is not much more > trivial than reversing the encoding. And either way, a sha256 hash can > easily be inverted in a few days(?) using a dictionary attack with > email addresses from data breaches. It's going to be "milliseconds", not "days". Brute-forcing a SHA-256 to find an unknown E-Mail address might take longer, but by definition for a .mailmap entry you already have both sides. So "brute-forcing" is just a matter of hashing authors & E-Mails in our history, and seeing if they correspond to .mailmap entries. > As someone who has changed her name, I would be content with using a > simple URL encoding. I'd be happy to have that as a feature, in particular because (as I pointed out in the previous discussion) it has a large use-case outside of this .mailmap topic, namely wanting to map e.g. mis-encoded author names in past commits to the right encoding (which I've personally had some use-cases for). There might be other "bonus" use-cases I've missed. E.g. is ">" or "<" allowed in obscure E-Mail addresses (maybe within quotes?), our current parser would barf on it, but being able to URI-encode it would work around that. I don't know offhand to what extent there's an overlap with various RFC-pedantic E-Mail addresses one could come up with, and what we'd accept in commit objects with "fsck". In any case, I think that an implementation of this & patch to gitmailmap(5) should explain this sort of feature in those terms. If some people then find it useful to encode things in the ASCII-space for some reason (e.g. the social "deadname" reason) that would also be useful. But in terms the docs I don't think it should be documented in that way. Git just needs to provide the feature, we don't need to dictate how & why someone might use it. >> [...] >> $ git log --format='%ae %aE' | >> awk '$1 != $2 && !a[$0] {a[$0] = 1; print}' | >> grep -F l.s.r@xxxxxx >> rene.scharfe@xxxxxxxxxxxxxx l.s.r@xxxxxx >> >> The same can be done with names (%an/%aN). > > You're absolutely right. With "advanced tools" I was referring to > anything more advanced than a plain `git log` ;-) The thing that still makes me a bit nervous on this topic is that we need to make it really clear that we're *not* providing some promise of obscuring these values going forward, but just providing a feature that some people might rely on as a combined social mechanism, and with the assumption that the defaults of the "git log" view are unlikely to change. I.e. I think a "deadname" use-case of this would probably: * Have some comment at the top of .mailmap about why some values are over-encoded (or perhaps it would be obvious to everyone working on that repo why someone was encoding the "plain ASCII" A-Za-z0-9 space). * Use the default "git log" view, where we happen to map these (given the right options, config etc.) But should not: * Assume that other tools such as "fsck", "check-mailmap" or even "log" won't have future features that make de-obscuring these values easier, or something that's part of a normal workflow. E.g. I've wanted a "fsck for mailmap" for a while, i.e. to scan the file, parse our history, and see which entries are redundant or even potentially missing (based on e.g. names matching, but having different E-Mail addresses). It would be hard not to de-obscure URI encoded values for some features like that, e.g. if "log" adds the ability to say "this name X was mapped from Y". * In general pretend that the mailmap is anything but a *public* and easily readable mapping. It's inherent in the feature that the consumer of it will know that X used to be Y. The last thing we want is to create some feature that effectively ends up being some self-doxxing (or self-"de-deadnaming"?) mechanism, because we've left a gap between user expectations and what we can realistically provide.