Re: Wildcards in mailmap to hide transgender people's deadnames

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Mon, 19 Sep 2022 13:20:13 +0200

On Wed, Sep 14 2022, Florine W. Dekker wrote:

> On 14/09/2022 09:40, René Scharfe wrote:
>> Am 13.09.22 um 23:53 schrieb Florine W. Dekker:
>>> Now, John can now add the following line to their mailmap config:
>>> `John Doe <john.doe@xxxxxxxxxxx> <\*.doe@xxxxxxxxxxx>`, which does
>>> not reveal their old name.
>> That would falsely attribute the work of possible future developers
>> ann.doe@xxxxxxxxxxx and bob.doe@xxxxxxxxxxx to John as well.

First, I'm very happy to see that someone has picked up the thread on
this again.

> Good point. I assumed such false positives would be unlikely because I
> was considering very-small-scale projects, but I agree that using 
> wildcards is not at all feasible for larger projects.

Yes, please, making the mapping fuzzy in any way is really going against
the core design of the mailmap mechanism, it should be unambiguous,
*also* for commits going forward.

>> Supporting hashed entries would allow for a more targeted obfuscation.
>> That was discussed a while ago:
>> https://lore.kernel.org/git/20210103211849.2691287-1-sandals@xxxxxxxxxxxxxxxxxxxx/
>
> That was an interesting read. I agree with Ævar in that thread in that
> I think URL encoding is sufficient. I think it meets Brian's use case
> of never having to see the old name again, and my use case of
> obfuscating it from accidental discovery by friendly
> collaborators.

The question that was left open in my mind after that previous
discussion was weather people who wanted the "deadname" feature would
find this acceptable, I don't think we got any explicit ACK/NACK on that
(but I may be misrecalling, and didn't go back & re-read the whole
thing).

I'm happy that there's at least one ACK to it here in the form of your
reply, and hopefully that represents what a wider audience would prefer.

> While a hash certainly gives a stronger sense of
> security, I think it's a false sense of security, because, as you note
> below, recovering old email addresses from the tree is not much more
> trivial than reversing the encoding. And either way, a sha256 hash can
> easily be inverted in a few days(?) using a dictionary attack with
> email addresses from data breaches.

It's going to be "milliseconds", not "days". Brute-forcing a SHA-256 to
find an unknown E-Mail address might take longer, but by definition for
a .mailmap entry you already have both sides.

So "brute-forcing" is just a matter of hashing authors & E-Mails in our
history, and seeing if they correspond to .mailmap entries.

> As someone who has changed her name, I would be content with using a
> simple URL encoding.

I'd be happy to have that as a feature, in particular because (as I
pointed out in the previous discussion) it has a large use-case outside
of this .mailmap topic, namely wanting to map e.g. mis-encoded author
names in past commits to the right encoding (which I've personally had
some use-cases for).

There might be other "bonus" use-cases I've missed. E.g. is ">" or "<"
allowed in obscure E-Mail addresses (maybe within quotes?), our current
parser would barf on it, but being able to URI-encode it would work
around that. I don't know offhand to what extent there's an overlap with
various RFC-pedantic E-Mail addresses one could come up with, and what
we'd accept in commit objects with "fsck".

In any case, I think that an implementation of this & patch to
gitmailmap(5) should explain this sort of feature in those terms. If
some people then find it useful to encode things in the ASCII-space for
some reason (e.g. the social "deadname" reason) that would also be
useful.

But in terms the docs I don't think it should be documented in that
way. Git just needs to provide the feature, we don't need to dictate how
& why someone might use it.

>> [...]
>>     $ git log --format='%ae %aE' |
>>       awk '$1 != $2 && !a[$0] {a[$0] = 1; print}' |
>>       grep -F l.s.r@xxxxxx
>>     rene.scharfe@xxxxxxxxxxxxxx l.s.r@xxxxxx
>>
>> The same can be done with names (%an/%aN).
>
> You're absolutely right. With "advanced tools" I was referring to
> anything more advanced than a plain `git log` ;-)

The thing that still makes me a bit nervous on this topic is that we
need to make it really clear that we're *not* providing some promise of
obscuring these values going forward, but just providing a feature that
some people might rely on as a combined social mechanism, and with the
assumption that the defaults of the "git log" view are unlikely to
change.

I.e. I think a "deadname" use-case of this would probably:

* Have some comment at the top of .mailmap about why some values are
  over-encoded (or perhaps it would be obvious to everyone working on
  that repo why someone was encoding the "plain ASCII" A-Za-z0-9 space).

* Use the default "git log" view, where we happen to map these (given
  the right options, config etc.)

But should not:

* Assume that other tools such as "fsck", "check-mailmap" or even "log"
  won't have future features that make de-obscuring these values easier,
  or something that's part of a normal workflow.

  E.g. I've wanted a "fsck for mailmap" for a while, i.e. to scan the
  file, parse our history, and see which entries are redundant or even
  potentially missing (based on e.g. names matching, but having
  different E-Mail addresses).

  It would be hard not to de-obscure URI encoded values for some
  features like that, e.g. if "log" adds the ability to say "this name X
  was mapped from Y".

* In general pretend that the mailmap is anything but a *public* and
  easily readable mapping. It's inherent in the feature that the
  consumer of it will know that X used to be Y.

The last thing we want is to create some feature that effectively ends
up being some self-doxxing (or self-"de-deadnaming"?) mechanism, because
we've left a gap between user expectations and what we can realistically
provide.