Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 12 Nov 2012 09:45:12 -0800

A Large Angry SCM <gitzilla@xxxxxxxxx> writes:

> On 11/11/2012 07:41 AM, Felipe Contreras wrote:
>> On Sat, Nov 10, 2012 at 8:25 PM, A Large Angry SCM<gitzilla@xxxxxxxxx>  wrote:
>>> On 11/10/2012 01:43 PM, Felipe Contreras wrote:
>>
>>>> So, the options are:
>>>>
>>>> a) Leave the name conversion to the export tools, and when they miss
>>>> some weird corner case, like 'Author<email', let the user face the
>>>> consequences, perhaps after an hour of the process.
>>>>
>>>> We know there are sources of data that don't have git-formatted author
>>>> names, so we know every tool out there must do this checking.
>>>>
>>>> In addition to that, let the export tool decide what to do when one of
>>>> these bad names appear, which in many cases probably means do nothing,
>>>> so the user would not even see that such a bad name was there, which
>>>> might not be what they want.
>>>>
>>>> b) Do the name conversion in fast-import itself, perhaps optionally,
>>>> so if a tool missed some weird corner case, the user does not have to
>>>> face the consequences.
>>>>
>>>> The tool writers don't have to worry about this, so we would not have
>>>> tools out there doing a half-assed job of this.
>>>>
>>>> And what happens when such bad names end up being consistent: warning,
>>>> a scaffold mapping of bad names, etc.
>>>>
>>>>
>>>> One is bad for the users, and the tools writers, only disadvantages,
>>>> the other is good for the users and the tools writers, only
>>>> advantages.
>>>>
>>>
>>> c) Do the name conversion, and whatever other cleanup and manipulations
>>> you're interesting in, in a filter between the exporter and git-fast-import.
>>
>> Such a filter would probably be quite complicated, and would decrease
>> performance.
>>
>
> Really?
>
> The fast import stream protocol is pretty simple. All the filter
> really needs to do is pass through everything that isn't a 'commit'
> command. And for the 'commit' command, it only needs to do something
> with the 'author' and 'committer' lines; passing through everything
> else.
>
> I agree that an additional filter _may_ decrease performance somewhat
> if you are already CPU constrained. But I suspect that the effect
> would be negligible compared to the all of the SHA-1 calculations.

More importantly, which do users prefer: quickly produce an
incorrect result, or spend some more time to get it right?

Because the exporting tool has a lot more intimate knowledge about
how the names are represented in the history of the original SCM,
canonicalization of the names, if done at that point, would likely
to give us more useful results, than a canonicalization done at the
beginning of the importer, which lacks SCM specific details.  So in
that sense, (a) is more preferrable than (b).

On the other hand, we would want consistency across the converted
results no matter what SCM the history was originally in.  E.g. a
name without email that came from CVS or SVN would consistently want
to become "name <noname@noname>" or "name <name>" or whatever, and
letting exporting tools responsible for the canonicalization will
lead them to create their own garbage.  In that sense, (b) can be
better than (a).

I think (c) implements worst of both choices. It cannot exploit
knowledge specific to the original SCM like (a) would, and while it
can enforce consistency the same way as (b) would, it would be a
separate program, unlike (b).

So...

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html