A Large Angry SCM <gitzilla@xxxxxxxxx> writes: > On 11/11/2012 07:41 AM, Felipe Contreras wrote: >> On Sat, Nov 10, 2012 at 8:25 PM, A Large Angry SCM<gitzilla@xxxxxxxxx> wrote: >>> On 11/10/2012 01:43 PM, Felipe Contreras wrote: >> >>>> So, the options are: >>>> >>>> a) Leave the name conversion to the export tools, and when they miss >>>> some weird corner case, like 'Author<email', let the user face the >>>> consequences, perhaps after an hour of the process. >>>> >>>> We know there are sources of data that don't have git-formatted author >>>> names, so we know every tool out there must do this checking. >>>> >>>> In addition to that, let the export tool decide what to do when one of >>>> these bad names appear, which in many cases probably means do nothing, >>>> so the user would not even see that such a bad name was there, which >>>> might not be what they want. >>>> >>>> b) Do the name conversion in fast-import itself, perhaps optionally, >>>> so if a tool missed some weird corner case, the user does not have to >>>> face the consequences. >>>> >>>> The tool writers don't have to worry about this, so we would not have >>>> tools out there doing a half-assed job of this. >>>> >>>> And what happens when such bad names end up being consistent: warning, >>>> a scaffold mapping of bad names, etc. >>>> >>>> >>>> One is bad for the users, and the tools writers, only disadvantages, >>>> the other is good for the users and the tools writers, only >>>> advantages. >>>> >>> >>> c) Do the name conversion, and whatever other cleanup and manipulations >>> you're interesting in, in a filter between the exporter and git-fast-import. >> >> Such a filter would probably be quite complicated, and would decrease >> performance. >> > > Really? > > The fast import stream protocol is pretty simple. All the filter > really needs to do is pass through everything that isn't a 'commit' > command. And for the 'commit' command, it only needs to do something > with the 'author' and 'committer' lines; passing through everything > else. > > I agree that an additional filter _may_ decrease performance somewhat > if you are already CPU constrained. But I suspect that the effect > would be negligible compared to the all of the SHA-1 calculations. More importantly, which do users prefer: quickly produce an incorrect result, or spend some more time to get it right? Because the exporting tool has a lot more intimate knowledge about how the names are represented in the history of the original SCM, canonicalization of the names, if done at that point, would likely to give us more useful results, than a canonicalization done at the beginning of the importer, which lacks SCM specific details. So in that sense, (a) is more preferrable than (b). On the other hand, we would want consistency across the converted results no matter what SCM the history was originally in. E.g. a name without email that came from CVS or SVN would consistently want to become "name <noname@noname>" or "name <name>" or whatever, and letting exporting tools responsible for the canonicalization will lead them to create their own garbage. In that sense, (b) can be better than (a). I think (c) implements worst of both choices. It cannot exploit knowledge specific to the original SCM like (a) would, and while it can enforce consistency the same way as (b) would, it would be a separate program, unlike (b). So... -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html