On Sun, 15 Jul 2007, Michael Haggerty wrote: > > 1. Is it a problem to create blobs that are never referenced? The > easiest point to create blobs is when the RCS files are originally > parsed, but later we discard some CVS revisions, meaning that the > corresponding blobs would never be needed. Would this be a problem? No, don't worry about it. The resulting intermediate pack-file may be unnecessarily big, but you'd want to do a "git gc" to re-pack everything afterwards *anyway*, since the pack-files git-fast-import generates are generally not all that optimall, and that will also prune any unreferenced blobs. > 2. It appears that author/committer require an email address. How > important is a valid email address here? Git itself doesn't really care, and many CVS conversions have just converted the username into "user <user>", but from a QoI standpoint it's much nicer if you at least were to allow the kind of conversion that allows user-name to be associated with an email. Maybe git-fast-import could be taught to do the kind of user name conversion that we already do for CVS imports.. Shawn? > a. CVS commits include a username but not an email address. If an > email address is really required, then I suppose the person doing the > conversion would have to supply a lookup table mapping username -> email > address. That would be optimal. Note that it's not just user names: it's much nicer if you can regenerate a readable full name too, so instead of having something like "torvalds <torvalds>", you could map "torvalds" into "Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>", which is a lot more readable. But as far as git is concerned, this is all about being _pretty_, it doesn't really have any semantic meaning! Anyway, git-cvsimport knows about a magic file ("CVSROOT/users") that can map user names into full names and emails. Having soemthing equvalent for a SVN import would be nice (git-svnimport does the same thing, and uses ".git/svn-authors" as the default source of author name conversion data). > b. CVS tag/branch creation events do not even include a username. > Any suggestions for what to use here? Git tags and branch creation doesn't do that either (unless you use signed tags): only when you create the first commit on a branch does the user matter. But if there really is data that doesn't have any user information at all (for real *changes*), then I'd just make one up. Again, the user information really doesn't have any *semantics* in git, it's just meant to be informational for showing the logs. It's nothing more than a structured part of the commit (or tag) message. > 3. I expect we should set 'committer' to the value determined from CVS > and leave 'author' unused. But I suppose another possibility would be > to set the 'committer' to 'cvs2svn' and the 'author' to the original CVS > author. Which one makes sense? Just make them be the same. Git-fast-import will default to that, if you only give a committer date/name. That's what git itself does if you just do a "git commit": the committer will the the same as the author. > 4. It appears that a commit can only have a single 'from' No, commits can have an arbitrary number of parents, and if you create a tag where the data comes from several sources, you could literally do that ass a really strange merge, and that would probably be the most "correct" thing to do, even if it might end up looking *really* odd. [ To be strictly technically correct, I have to admit that I think we limit the number of parents to 16, but that's not a fundamental limit, that's just because nobody has ever been so crazy as to need more than that. However, there is no "data structure limit" in that number, it's just aa arbitrary "you'd be crazy to generate a merge of that many parents" kind of thing, and we could lift the limit if you actually think it's worth it. I think the most we have ever seen in practice is a merge of 12 parents, and the people who did that were told to please not do it again, because it really does make the graph look extremely "cool". ] > What would be the most git-like way to handle this situation? Should > the branch be created in one commit, then have files from other sources > added to it in other commits? Or should (is this even possible?) all > files be added to the branch in a single commit, using multiple "merge" > sources? Using multiple parents and just generating a single commit (it will be called a "merge", but really, in git terms a commit is just a commit, and the difference in number of parents is really not a _technical_ difference, it's just a difference for how these things get visualized). It would be extremely interesting to see how this works in practice, but I _think_ it would work really well. The possible downsides might be: - it *may* just end up looking so confusing that people would prefer some alternate model. - we might have some performance issues with lots and lots of parents, and maybe we'd need to fix something. In particular, I can well imagine that showing the diff for the end result would be "interesting" (read: "totally useless") > 5. Is there any significance at all to the order that commits are output > to git-fast-import? Obviously, blobs have to be defined before they are > used, and '<committish>'s have to be defined before they are referenced. > But is there any other significance to the order of commits? Not afaik. Git internally very fundamentally simply doesn't care (there simply _is_ no object ordering, there is just objects that point to other objects), and I don't think git-fast-import could possibly care either. You do need to be "topologically" sorted (since you cannot even point to commits without having their SHA1's), but that should be it. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html