On 8/7/06, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
> I'm staring at the cvs2svn code now trying to figure out how to modify > it without rewriting everything. I may just leave it all alone and > build a table with cvs_file:rev to sha-1 mappings. It would be much > more efficient to carry sha-1 throughout the stages but that may > require significant rework. Does it matter? How long does the cvs2svn processing take, excluding the GIT blob processing that's now known to take 2 hours? What's your target for an acceptable conversion time on the system you are working on?
As is, it takes the code about a week to import MozCVS into Subversion. But I've already addressed the core of why that was taking so long. The original code forks off a copy of cvs for each revision to exact the text. Doing that 1M times takes about two days. The version with fast-import takes two hours. At the end of the process cvs2svn forks off svn 250K times to import the change sets. That takes about four days to finish. Doing a fast-import backend should fix that.
Any thoughts yet on how you might want to feed trees and commits to a fast pack writer? I was thinking about doing a stream into fast-import such as:
The data I have generates an output that indicates add/change/delete for each file name. Add/change should have an associated sha-1 for the new revision. cvs/svn have no concept of trees. How about sending out a stream of add/change/delete operations interspersed with commits? That would let fast-import track the tree and only generate tree nodes when they change. The protocol may need some thought. I need to be able to handle branches and labels too.
<4 byte length of commit><commit><treeent>*<null> where <commit> is the raw commit minus the first "tree nnn\n" line, and <treeent> is: <type><sp><sha1><sp><path><null> where <type> is one of 'B' (normal blob), 'L' (symlink), 'X' (executable blob), <sha1> is the 40 byte hex, <path> is the file from the root of the repository ("src/module/foo.c"), and <sp> and <null> are the obvious values. You would feed all tree entries and the pack writer would split the stream up into the individual tree objects. fast-import would generate the tree(s) delta'ing them against the prior tree of the same path, prefix "tree nnn\n" to the commit blob you supplied, generate the commit, and print out its ID. By working from the first commit up to the most recent each tree deltas would be using the older tree as the base which may not be ideal if a large number of items get added to a tree but should be effective enough to generate a reasonably sized initial pack. It would however mean you need to monitor the output pipe from fast-import to get back the commit id so you can use it to prep the next commit's parent(s) as you can't produce that in Python. -- Shawn.
-- Jon Smirl jonsmirl@xxxxxxxxx - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html