Re: [PATCH v2 5/5] mailmap: support hashed entries in mailmaps

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Sun, 10 Jan 2021 20:24:34 +0100

On Wed, Jan 06 2021, brian m. carlson wrote:

> On 2021-01-05 at 14:21:40, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Sun, Jan 03 2021, brian m. carlson wrote:
>> 
>> I think it makes sense to split up 1-4/5 here from 5/5 in this series
>> since they're really unrelated changes, although due to the changes in
>> 1-4 they'll conflict.
>
> Okay, I'll drop them.

Not replying to most of this E-Mail because I think there's nothing left
to add / you clarified things for me in those cases / we respectfully
disagree / any outstanding points we can pick up in your re-roll /
whatever :)

>> So we're talking about hiding the old E-Mail, presumably because it was
>> joe@ intsead of jane@, so in that case we could just support URI
>> encoding:
>> 
>>     Jane Doe <jane@xxxxxxxxxxx>
>>     <jane@xxxxxxxxxxx> <%6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D>
>> 
>> Made via:
>> 
>>     $ perl -MURI::Escape=uri_escape -wE 'say uri_escape q[joe@xxxxxxxxxxxxx], "^@."'
>>     %6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D
>> 
>> Which also has the nice attribute that people can make it obvious what
>> part they want to hide, since this is really a feature to enable social
>> politeness & consideration:
>> 
>>     Jane Doe <jane@xxxxxxxxxxx>
>>     # I don't want to be known by my old name, thanks
>>     <jane@xxxxxxxxxxx> <%6A%6F%65@xxxxxxxxxxxxx>
>
> I don't think this feature is going to get used if we just encode names
> or email addresses.  In the United States, when someone transitions,
> they get a court order to change their name.  I don't think a lot of
> corporate environments are going to want to just encode an old name or
> email address in a trivially invertible way given that.  This is
> typically a topic handled with some sensitivity in most companies.
>
> I will tell you that I would not just use an encoded version if I were
> changing my name for any of the reasons I've mentioned.  That wouldn't
> cut it for me, and I wouldn't use such a feature.  The feature I'm
> implementing is a feature I've talked with trans folks about, and that's
> why I'm implementing this as it is.  The response I got was essentially,
> "It's not everything I want, but it's an improvement."
>
> If the decision is that we want to go with encoding instead of hashing,
> then I'll drop this patch.  I'm not going to put my name or sign-off on
> that because I don't think it meets the need I'm addressing here.
>
> The entire problem, of course, is that we bake a human's personal name
> and email address immutably into a Merkle tree.  We know full well that
> people do change their names and email addresses all the time (e.g.,
> marriage, job changes), and yet we have this design.  In retrospect, we
> should have done something different, but hindsight is 20/20 and I'm
> just trying to do the best we can with what we've got.

Doesn't the difference in some sense boil down to either an implicit
promise or an implicit assumption that the hashed version is forever
going to be protected by some security-through-obscurity/inconvenience
when it comes to git.git & its default tooling?

And would those users be as comfortable with the difference between
encoded v.s. hashed if e.g. "git check-mailmap" learned to read the
.mailmap and search-replace all the hashed versions with their
materialized values, or if popular tools like Emacs learned to via a Git
.mailmap in a "need translation" similar to *.gpg and *.gz. How about if
popular web views of Git served up that materialized "check-mailmap"
output by default?

None of which I think is implausible that we'll get as follow-up
patches, I might even submit some at some point, not out of some spite.
Just because I don't want to maintain out-of-tree code for an
out-of-tree program that understands a Git .mailmap today, but where I'd
need to search-replace the hashed versions.

Ditto it being very likely that popular editors or web viewers will gain
support for this, just because it's tedious to manually hash &
copy/paste & validate values.

In looking at some of the fsck code recently & having some
yet-unsubmitted patches I thought of trying to compine it with
mailmap. I.e. it seems like a natural feature for fsck to gain to warn
you about unused mailmap entries, just like it can warn about
unreachable/dangling objects. After all these are really just sort-of
pointers into our Merkle tree. Spewing out all the mappings seems like
an obvious addition to that, e.g. in spewing out an
"optimized/non-redundant" (plain or hashed) mailmap to re-commit.

That's the main reason I'm uncomfortable with this approach, because it
seems to me to implicitly rely on things that are tedious now, but which
the march of history all but inevitably should make trivial if we were
to integrate it. Unless we're *also* promising to forever intentionally
(and artificially) keep it inconvenient.

E.g. the example of how long it takes to clone & extract this info from
chromium.git in the v1 thread.

It seems like a fair assumption that we'll have some future version of
git where you can ask a remote server about that sort of thing in
milliseconds.

Not because of this hashed .mailmap thing in particular, just as an
emergent effect that it's happy to serve up things it knows about the
DAG from having walked & cached it in general. E.g. info from the
commit-graph, what hash is contained in what ref, or how one value (such
as a .mailmap entry) maps to another etc.