Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

Felipe Contreras <felipe.contreras@xxxxxxxxx> · Sun, 11 Nov 2012 19:48:14 +0100

On Sun, Nov 11, 2012 at 7:14 PM, Jeff King <peff@xxxxxxxx> wrote:
> On Sun, Nov 11, 2012 at 06:45:32PM +0100, Felipe Contreras wrote:
>
>> > If there is a standard filter, then what is the advantage in doing it as
>> > a pipe? Why not just teach fast-import the same trick (and possibly make
>> > it optional)? That would be simpler, more efficient, and it would make
>> > it easier for remote helpers to turn it on (they use a command-line
>> > switch rather than setting up an extra process).
>>
>> Right, but instead of a command-line switch it probably should be
>> enabled on the stream:
>>
>>   feature clean-authors
>>
>> Or something.
>
> Yeah, I was thinking it would need a feature switch to the remote helper
> to turn on the command-line, but I forgot that fast-import can take
> feature lines directly.
>
>> > We can clean up and normalize
>> > things like whitespace (and we probably should if we do not do so
>> > already). But beyond that, we have no context about the name; only the
>> > exporter has that.
>>
>> There is no context.
>
> There may not be a lot, but there is some:
>
>> These are exactly the same questions every exporter must answer. And
>> there's no answer, because the field is not a git author, it's a
>> mercurial user, or a bazaar committer, or who knows what.
>
> The exporter knows that the field is a mercurial user (or whatever).
> Fast-import does not even know that, and cannot apply any rules or
> heuristics about the format of a mercurial user string, what is common
> in the mercurial world, etc. It may not be a lot of context in some
> cases (I do not know anything about mercurial's formats, so I can't say
> what knowledge is available). But at least the exporter has a chance at
> domain-specific interpretation of the string. Fast-import has no chance,
> because it does not know the domain.
>
> I've snipped the rest of your argument, which is basically that
> mercurial does not have any context at all, and knowing that it is a
> mercurial author is useless.  I am not sure that is true; even knowing
> that it is a free-form field versus something structured (e.g., we know
> CVS authors are usernames on the server server) is useful.

It is useful in the sense that we know we cannot do anything sensible
about it. All we can do is try.

> But I would agree there are probably multiple systems that are like
> mercurial in that the author field is usually something like "name
> <email>", but may be arbitrary text (I assume bzr is the same way, but
> you would know better than me).  So it may make sense to have some stock
> algorithm to try to convert arbitrary almost-name-and-email text into
> name and email to reduce duplication between exporters, but:

Yes, bazaar seems to be the same way.

% bzr log
------------------------------------------------------------
revno: 1
committer: Foo Bar<foo.bar@xxxxxxxxxxx> <none@none
branch nick: bzr
timestamp: Sun 2012-11-11 19:41:10 +0100
message:
  one

>   1. It must be turned on explicitly by the exporter, since we do not
>      want to munge more structured input from clueful exporters.

Agreed.

>   2. The exporter should only turn it on after replacing its own munging
>      (e.g., it shouldn't be adding junk like <none@none>; fast-import
>      would need to receive as pristine an input as possible).

Agreed.

>   3. Exporters should not use it if they have any broken-down
>      representation at all. Even knowing that the first half is a human
>      name and the second half is something else would give it a better
>      shot at cleaning than fast-import would get.

I'm not sure what you mean by this. If they have name and email, then
sure, it's easy.

And for the record, I've have encountered this problem also with
monotone. There's quite a lot of strategies to convert names to git
authors.

Cheers.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html