On 6/5/06, Alec Warner <antarus@xxxxxxxxxx> wrote:
Ok the box this was running on had issues, so I switched to using pearl.amd64.dev.gentoo.org, a dual core amd64 X2 4600+ with 4 gigs of ram and plenty of disk. The "problem" now is just converstion time...30 hours and I'm into 2004-09-17...but it's been in 2004 all day, seems like most of the commits are in the last three years. Are there architectural issues with doing this in parallel?
I don't think you can do this in parallel. What I would do is remove the -a from the git-repack invocation. It does hurt import times quite a bit -- just do a git-repack -a -d when it's done. And... having said that, there is still a memory leak somehow, somewhere. It's been evading me for 2 weeks now, so I feel an idiot now. Not too bad in general, but it shows clearly in the gentoo and mozilla imports.
Since the repository commits are all in cvs, it should be possible to do the work in parallel, since you know what all the commits touch. The concern would be ordering of nodes in the tree; you'd end up building a bunch of subtrees and patching them together?
Well... parsecvs does a bit of this but in sequential fashion... it imports all the files first, and then runs through the history building the tree+commits in order, committing them. It saves a lot of time in the file imports by parsing the RCS file directly. The downside is that it must keep a filename+version=>sha1 mapping -- which I think is why parsecvs won't fit in memory until it's changed to store it on disk somehow ;-) You are forced to do it in a sequence because cvsps only tells you about the files added/removed/changed in a commit -- you need the ancestor to have a view of what the whole tree looked like. The only room for parallelism I see is to fork off new processes to work on branches in parallel. martin - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html