Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

Jeff King <peff@xxxxxxxx> · Sun, 11 Nov 2012 13:14:06 -0500

On Sun, Nov 11, 2012 at 06:45:32PM +0100, Felipe Contreras wrote:

> > If there is a standard filter, then what is the advantage in doing it as
> > a pipe? Why not just teach fast-import the same trick (and possibly make
> > it optional)? That would be simpler, more efficient, and it would make
> > it easier for remote helpers to turn it on (they use a command-line
> > switch rather than setting up an extra process).
> 
> Right, but instead of a command-line switch it probably should be
> enabled on the stream:
> 
>   feature clean-authors
> 
> Or something.

Yeah, I was thinking it would need a feature switch to the remote helper
to turn on the command-line, but I forgot that fast-import can take
feature lines directly.

> > We can clean up and normalize
> > things like whitespace (and we probably should if we do not do so
> > already). But beyond that, we have no context about the name; only the
> > exporter has that.
> 
> There is no context.

There may not be a lot, but there is some:

> These are exactly the same questions every exporter must answer. And
> there's no answer, because the field is not a git author, it's a
> mercurial user, or a bazaar committer, or who knows what.

The exporter knows that the field is a mercurial user (or whatever).
Fast-import does not even know that, and cannot apply any rules or
heuristics about the format of a mercurial user string, what is common
in the mercurial world, etc. It may not be a lot of context in some
cases (I do not know anything about mercurial's formats, so I can't say
what knowledge is available). But at least the exporter has a chance at
domain-specific interpretation of the string. Fast-import has no chance,
because it does not know the domain.

I've snipped the rest of your argument, which is basically that
mercurial does not have any context at all, and knowing that it is a
mercurial author is useless.  I am not sure that is true; even knowing
that it is a free-form field versus something structured (e.g., we know
CVS authors are usernames on the server server) is useful.

But I would agree there are probably multiple systems that are like
mercurial in that the author field is usually something like "name
<email>", but may be arbitrary text (I assume bzr is the same way, but
you would know better than me).  So it may make sense to have some stock
algorithm to try to convert arbitrary almost-name-and-email text into
name and email to reduce duplication between exporters, but:

  1. It must be turned on explicitly by the exporter, since we do not
     want to munge more structured input from clueful exporters.

  2. The exporter should only turn it on after replacing its own munging
     (e.g., it shouldn't be adding junk like <none@none>; fast-import
     would need to receive as pristine an input as possible).

  3. Exporters should not use it if they have any broken-down
     representation at all. Even knowing that the first half is a human
     name and the second half is something else would give it a better
     shot at cleaning than fast-import would get.

     Alternatively, the feature could enable the exporter to pass a more
     structured ident to git.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html