Re: [PATCH 1/1] mailmap: support hashed entries in mailmaps

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Mon, 14 Dec 2020 12:54:13 +0100

On Sun, Dec 13 2020, brian m. carlson wrote:

> Many people, through the course of their lives, will change either a
> name or an email address.  For this reason, we have the mailmap, to map
> from a user's former name or email address to their current, canonical
> forms.  Normally, this works well as it is.
>
> However, sometimes people change a name or an email address and wish to
> wholly disassociate themselves from that former name or email address.
> For example, a person may have left a company which engaged in a deeply
> unethical act with which the person does not want to be associated, or
> they may have changed their name to disassociate themselves from an
> abusive family or partner.  In such a case, using the former name or
> address in any way may be undesirable and the person may wish to replace
> it as completely as possible.
>
> [...]
>
> Note that it is, of course, possible to perform a lookup on all commit
> objects to determine the actual entry which matches the hashed form of
> the data.

The commit message & cover letter are subtly different in a way that I
didn't even notice at first glance. E.g. I assume based on the cover
letter that one part of this this is a proposed solution do the whole
"deadname" problem. It would be nice if v2 were more explicit and
attempted to explicitly summarize the use-cases in the commit message.

But for now I'll attempt to read between the lines from having read
both.

I don't understand why either the problem of "I don't want to see my old
name again" or "I want to hide from other abusive people" (as an aside:
but not so much that you'd still take the risk of submitting a patch to
.mailmap?) require a hashing solution, as opposed to just some encoding
in the .mailmap file such as base64.

You can still trivially get the same information in the end, on git.git
running --pretty=format:"%aN %aE %an %ae" takes under a second. A part
of your commit message seems to address this:

> However, a project for which this feature is valuable may
> simply insert entries for many contributors in order to make discovery
> of "interesting" entries significantly less convenient.

But I don't get how that's helped at all by a sha256 hash. Since you can
trivially re-expand these again using log/check-mailmap the hashing
offers no extra protection beyond a trivial layer of obscurity in those
cases. You'd get the same safety in numbers by having everything a large
un-hashed .mailmap file, would you not?

I think the underlying use-case is legitimate, but I read it as
primarily a social signaling feature by a trivial addition of
obscurity. Someone called X would like not to be called Y anymore, or
not be found in a search engine or "git grep" when searching for "Y".

So I'd think purely from the perspective of the feature's appearance to
users matching its underlying security we'd be better served with
support for encoding of some sort. E.g. URL encoding, Base64, or even
just string_reverse() (ROT13 is out as not working for non-ASCII names).

The encoding versions of this have the added bonus of expanding the
use-case beyond what you're suggesting. If you're trying to map e.g. a
non-UTF-8 E-Mail address (in your project due to some encoding error)
you'd be able to put it into .mailmap without making the project
maintainers deal with invalid non-UTF-8 encoding in the file (the
existing support is sufficient to map names in most such cases).

Another reason I'd prefer some encoding solution is because .mailmap
isn't just used by git itself. Since the format got added it's become
how a lot of downstream systems do this mapping. E.g. I worked once on a
change management system that mapped lots of user actions across
different systems, and piggy-backed on .mailmap files in git to resolve
E-Mail addresses even in cases where the originating data wasn't within
git.

Now because of the trivialness of the format it's easy to e.g. import it
into a DB table and do a JOIN against it (or the same after converting
it from some trivial encoding). Use-cases like that would become a full
history walk for each project to extract the real E-Mails (or a re
implementation of the SHA256 trick in some sub-SELECT in the database).

Those are all solvable problems that are rather trivial in the end. I
just wonder if we're not making things needlessly hard to achieve the
stated aims. And to be fair, most of those aims I inferred (and might
have incorrectly inferred), since as noted above the patch itself
doesn't discuss the tradeoffs of potential alternate solutions).

> Signed-off-by: brian m. carlson <sandals@xxxxxxxxxxxxxxxxxxxx>
> ---
>  mailmap.c          | 39 +++++++++++++++++++++++++++++++++++++--
>  t/t4203-mailmap.sh | 35 +++++++++++++++++++++++++++++++++++
>  2 files changed, 72 insertions(+), 2 deletions(-)
>
> [...]
>
>  int map_user(struct string_list *map,
>  	     const char **email, size_t *emaillen,
>  	     const char **name, size_t *namelen)
> @@ -324,7 +359,7 @@ int map_user(struct string_list *map,
>  		 (int)*namelen, debug_str(*name),
>  		 (int)*emaillen, debug_str(*email));
>  
> -	item = lookup_prefix(map, *email, *emaillen);
> +	item = lookup_one(map, *email, *emaillen);
>  	if (item != NULL) {
>  		me = (struct mailmap_entry *)item->util;
>  		if (me->namemap.nr) {
> @@ -334,7 +369,7 @@ int map_user(struct string_list *map,
>  			 * simple entry.
>  			 */
>  			struct string_list_item *subitem;
> -			subitem = lookup_prefix(&me->namemap, *name, *namelen);
> +			subitem = lookup_one(&me->namemap, *name, *namelen);
>  			if (subitem)
>  				item = subitem;
>  		}

If you turn on DEBUG_MAILMAP=1 at the top of the file and run e.g. an
unbounded --pretty=format=:%aE you can see we'll call map_user() in a
loop for each commit shown. What I'm suggesting above can be read as
"can't we have some solution that achieves the same aims, but which we
can handle purely in add_mapping()?". Both for our case, and for
external parsers/re-implementations.

In any case it would be interesting if v2 amended
t/perf/p4205-log-pretty-formats.sh to test e.g. the impact of linux.git
with all-sha256 entries to see what the cost in the tight loop could be.