Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

Felipe Contreras <felipe.contreras@xxxxxxxxx> · Tue, 13 Nov 2012 19:15:59 +0100

On Tue, Nov 13, 2012 at 11:15 AM, Michael J Gruber
<git@xxxxxxxxxxxxxxxxxxxx> wrote:
> Felipe Contreras venit, vidit, dixit 12.11.2012 23:47:
>> On Mon, Nov 12, 2012 at 10:41 PM, Jeff King <peff@xxxxxxxx> wrote:
>>> On Sun, Nov 11, 2012 at 07:48:14PM +0100, Felipe Contreras wrote:
>>>
>>>>>   3. Exporters should not use it if they have any broken-down
>>>>>      representation at all. Even knowing that the first half is a human
>>>>>      name and the second half is something else would give it a better
>>>>>      shot at cleaning than fast-import would get.
>>>>
>>>> I'm not sure what you mean by this. If they have name and email, then
>>>> sure, it's easy.
>>>
>>> But not as easy as just printing it. What if you have this:
>>>
>>>   name="Peff <angle brackets> King"
>>>   email="<peff@xxxxxxxx>"
>>>
>>> Concatenating them does not produce a valid git author name. Sending the
>>> concatenation through fast-import's cleanup function would lose
>>> information (namely, the location of the boundary between name and
>>> email).
>>
>> Right. Unfortunately I'm not aware of any DSCM that does that.
>>
>>> Similarly, one might have other structured data (e.g., CVS username)
>>> where the structure is a useful hint, but some conversion to name+email
>>> is still necessary.
>>
>> CVS might be the only one that has such structured data. I think in
>> subversion the username has no meaning. A 'felipec' subversion
>> username is as bad as a mercurial 'felipec' username.
>
> In subversion, the username has the clearly defined meaning of being a
> username on the subversion host. If the host is, e.g., a sourceforge
> site then I can easily look up the user profile and convert the username
> into a valid e-mail address (<username>@users.sf.net). That is the
> advantage that the exporter (together with user knowledge) has over the
> importer.
>
> If the initial clone process aborts after every single "unknown" user
> it's no fun, of course. On the other hand, if an incremental clone
> (fetch) let's commits with unknown author sneak in it's no fun either
> (because I may want to fetch in crontab and publish that converted beast
> automatically). That is why I proposed neither approach.
>
> Most conveniently, the export side of a remote helper would
>
> - do "obvious" automatic lossless transformations
> - use an author map for other names

This should be done by fast-import. It doesn't make any sense that
every remote helper and fast-exporter out there have their own way of
mapping authors (or none).

> - For names not covered by the above (or having an empty map entry):
> Stop exporting commits but continue parsing commits and amend the author
> map with any unknown usernames (empty entry), and warn the user.
> (crontab script can notify me based on the return code.)

Stop exporting commits but continue parsing commits? I don't know what
that means.

fast-import should try it's best to clean it up, warn the user, sure,
but also store the missing entry on a file, so that it can be filed
later (if the user so wishes).

> If the cloning involves a "foreign clone" (like the hg clone behind the
> scene) then the runtime of the second pass should be much smaller. In
> principle, one could even store all blobs and trees on the first run and
> skip that step on the second, but that would rely on immutability on the
> foreign side, so I dunno. (And to check the sha1, we have to get the
> blob anyways.)

No. There's no concept of partial clones... Either you clone, or you don't.

Wait if the remote helper didn't notice that the author was bad?
fast-import could just just leave everything up to that point, warn
abut what happened, and exit, but the exporter side would die in the
middle of exporting, and it might end up in a bad state, not saving
marks, or who knows what.

It wouldn't work.

The cloning should be full, and the bad authors stored in a scaffold author map.

> As for the format for incomplete entries (foo <some@where>), a technical
> guideline should suffice for those that follow guidelines.

fast-import should do that.

Cheers.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html