Re: fast-import and unique objects.

"Jon Smirl" <jonsmirl@xxxxxxxxx> · Mon, 7 Aug 2006 10:37:30 -0400

On 8/7/06, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
> I'm staring at the cvs2svn code now trying to figure out how to modify
> it without rewriting everything. I may just leave it all alone and
> build a table with cvs_file:rev to sha-1 mappings. It would be much
> more efficient to carry sha-1 throughout the stages but that may
> require significant rework.

Does it matter?  How long does the cvs2svn processing take,
excluding the GIT blob processing that's now known to take 2 hours?
What's your target for an acceptable conversion time on the system
you are working on?

As is, it takes the code about a week to import MozCVS into
Subversion. But I've already addressed the core of why that was taking
so long. The original code forks off a copy of cvs for each revision
to exact the text. Doing that 1M times takes about two days. The
version with fast-import takes two hours.

At the end of the process cvs2svn forks off svn 250K times to import
the change sets. That takes about four days to finish. Doing a
fast-import backend should fix that.

Any thoughts yet on how you might want to feed trees and commits
to a fast pack writer?  I was thinking about doing a stream into
fast-import such as:

The data I have generates an output that indicates add/change/delete
for each file name. Add/change should have an associated sha-1 for the
new revision. cvs/svn have no concept of trees.

How about sending out a stream of add/change/delete operations
interspersed with commits? That would let fast-import track the tree
and only generate tree nodes when they change.

The protocol may need some thought. I need to be able to handle
branches and labels too.

        <4 byte length of commit><commit><treeent>*<null>

where <commit> is the raw commit minus the first "tree nnn\n" line, and
<treeent> is:

        <type><sp><sha1><sp><path><null>

where <type> is one of 'B' (normal blob), 'L' (symlink), 'X'
(executable blob), <sha1> is the 40 byte hex, <path> is the file from
the root of the repository ("src/module/foo.c"), and <sp> and <null>
are the obvious values.  You would feed all tree entries and the pack
writer would split the stream up into the individual tree objects.

fast-import would generate the tree(s) delta'ing them against the
prior tree of the same path, prefix "tree nnn\n" to the commit
blob you supplied, generate the commit, and print out its ID.
By working from the first commit up to the most recent each tree
deltas would be using the older tree as the base which may not be
ideal if a large number of items get added to a tree but should be
effective enough to generate a reasonably sized initial pack.

It would however mean you need to monitor the output pipe from
fast-import to get back the commit id so you can use it to prep
the next commit's parent(s) as you can't produce that in Python.

--
Shawn.

--
Jon Smirl
jonsmirl@xxxxxxxxx
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html