Re: Some tips for doing a CVS importer

"Jon Smirl" <jonsmirl@xxxxxxxxx> · Mon, 27 Nov 2006 10:20:09 -0500

On 11/27/06, Michael Haggerty <mhagger@xxxxxxxxxxxx> wrote:
I am currently the main (and pretty much the only) cvs2svn maintainer.
Development has been proceeding more slowly lately because (1) I'm very
busy with my day job, and (2) nobody has stepped forward to help.

Jon Smirl wrote:
> #1) There needs to be a tool that can accurately import the
> repository. cvs2svn does not do this. The good programmers working on
> git could probably whip this out in a week or two if they wanted to.
> cvs2svn is very close but they refuse to solve the symbol dependency
> problem.

Jon, I wish you wouldn't portray as obstinacy what is simply a lack of
resources.  I would like very much to support other cvs2svn output
formats.  I think it would be great if other projects could benefit from
our work.  Most of the work I've been doing on cvs2svn lately has been
towards supporting other output SCMs.

cvs2avn is a nice piece of code, it is a worthy goal to have a
univeral conversion tool.

Jon Smirl wrote:
> I gave up on my cvs2git code, cvs2svn has been refactored so badly
> that it was too much trouble tracking. It would be easier to write it
> again. Most of the smarts from the import process is in the
> git-fastimport code which Shawn has. cvs2svn underwent a major
> algorithm change after I wrote the first version of git2svn.

I hope that by "badly" you mean "extensively" and not "poorly" :-\  If
you mean "poorly", then I'd like to hear your feedback/suggestions.

Extensively, the dependency rewrite changed things some much that my
patches were basically worthless. I tried merging them and gave up, it
would be more efficient to rewrite them or builld hooks in the right
places.

A large amount of refactoring has been needed to make the change to
dependency-based conversion possible, and a lot more to help support
different output formats.  I understand that this causes difficulties
for people trying to do parallel development, but most of the
refactoring was done before your first appearance on the cvs2svn mailing
lists.  If you had let us know what you were working on, I would have
avoided making conflicting changes (as I did with Oswald Buddenhagen's
commit-dependencies changes).

Jon Smirl wrote:
> I have tried all of the available CVS importers. None of them are
> without problems. If anyone is interested in writing one for git here
> are some ideas on how to structure it.
>
> 1) there is a working lex/yacc for CVS in the parsecvs source code
> 2) The first time you parse a CVS file record everything and don't
> parse it again.
> 3) When the file is first parsed use the deltas to generate the
> revisions and feed them to git-fastimport, just remember the SHA1 or
> an id in the import code. This is a critical step to getting decent
> performance.
> 4) If you do #1 and #2 you don't need to store CVS revision numbers
> and file names in memory. Because of that you can can easily do a
> Mozilla import in 2GB, probably 1GB.
> 5) When comparing CVS revisions only use the CVS timestamps as a last
> resort, instead use the dependency information in the CVS file
> 6) Match up commits by using an sha1 of the author and commit message
> 7) After all files are loaded, match up the symbols and insert them
> into the dependency chains, if any of the symbols depend on a branch
> commit the symbol lies on the branch, otherwise the symbol is on the
> trunk,
> 8) Do a topological sort to build the change set commit tree
> 9) when you hit a loop in the tree break up delta change sets until
> the loop can be removed, don't break up symbol change sets.
> 10) Mozilla has some large commits that were made over dial up. Commit
> change sets can span hours. All of these commits need to be merged
> into a single change set.
> 11) An algorithm needs to be developed for detecting branches merging
> back into the trunk
> 12) cvs2svn has excellent test cases, use them to test the new
> importer. The cvs2svn code is quite nice but it doesn't handle #7

Most of this is possible now using cvs2svn, but it is not enough.

But first there is a problem with your point #9.  It is in general not
possible to avoid breaking up symbol changesets, even if you are willing
to massacre the revision changesets.  CVS allows cases like this:

We don't know how often this case occurs until more alogirthms are
tried. All I know is that 60% of the Mozilla symbols end up needing
copies. And for the few cases I decoded things by hand I was able to
rearrange things so that copies were not needed. It is likely that
some symbols in Mozilla will need copies to construct them, it is a
question of degree, I don't believe copies are required for 60% of the
symbols.

file1:

    1.1
    1.2 ----> branch "A"
              1.2.0.1
              1.2.0.2 ----> branch "B"

file2:

    1.1
    1.2 ----> branch "B"
              1.2.0.1
              1.2.0.2 ----> branch "A"

Clearly there is no way to create symbols "A" and "B" both in a single
changeset.

But even disallowing cases like the one above, it is often very
questionable whether you want to avoid breaking up symbol commits at all
costs.  For example, CVS allows

January:     file1<1.1>               file2<1.1>
February:    file1<1.1> tagged "T"
March:       file1<1.2>
November:                             file2<1.2>
December:                             file2<1.2> tagged "T"

In such a case, the only way to avoid splitting up the creation of tag
"T" would be to pretend that the commit file1<1.2> didn't occur in March
but rather in November.

The bottom line is that cvs2svn should do a better job of handling
symbols, but even then the git importer will necessarily have to deal
with some unusual CVS cases.

The unusal cases can be made into branches. If I remember correctly
Mozilla has about 300 symbols with "BRANCH" in the name. But the
converted repositories are ending up with over 2,000 branches. When
you load this into the git visualization tools it is obvious that the
bowl of spaghetti caused by 2,000 branches is not a repository a human
would have created.

> Processing the symbols is integral to deciding how to build the change
> sets. Right now cvs2svn ignores the symbol dependency information and
> builds the change sets in a way that forces the mini-branches. That
> causes 60% of the 2,000 symbols in Mozilla CVS to end up as little
> branches. Look at the three commit example in the other thread to see
> exactly what the problem is.
>
> SVN hides the mini branch by creating a symbol like this:
>
> Symbol XXX, change set 70
> copy All from change set 50
> copy file A from change set 55
> copy file B,C from change set 60
> copy file D from change set 61
> copy file E,F,G from change set 63
> copy file H from change set 67
>
> It has to do all of those copies because the change sets weren't
> constructed while taking symbol dependency information into account.
>
> Symbol XXX can't copy from change set 69 because commits from after
> the symbol was created are included in change sets 51-69.

The vast majority of the mixed-source symbol creations have nothing to
do with honoring symbol dependencies, but rather with the fact the
cvs2svn is not so clever about deducing which branch should be used as
the source for a symbol (CVS often does not record this information
unambiguously).

Changes needed for git import:

The symbol dependency problem that Jon has focused on is IMO just the
least significant of three main changes that have to be made to support
git output from cvs2svn:

1. The symbol dependency problem.  Occasionally symbols are created in
an order that is inconsistent with the CVS dependency graph.  We want to
fix this in any case (even for SVN).  Work done so far: the symbol
dependency graph is already generated and recorded when the repository
is parsed, and the symbol dependencies are carried through the
conversion (though not yet used).

2. Symbols are often created using multiple branches as sources, when
they could be created from a single branch.  This happens because in
many cases CVS doesn't record unambiguously which branch was tagged, and
cvs2svn's heuristics are not especially clever.  A patch has been
submitted to fix this problem, but unfortunately it doesn't apply to
HEAD anymore.  See

http://cvs2svn.tigris.org/servlets/ReadMsg?list=dev&msgNo=1441

for a discussion.  (The main difficulty with picking better sources for
symbols is that the obvious approaches all require tons of intermediate
storage.)  I am currently trying to understand symbol handling in
cvs2svn well enough that I can port the patch to trunk.

I'm happy to give new alogorithm a try as they are developed.

3. The default current output format of cvs2svn is a single dump file
with file revisions in commit order.  For the distributed SCMs, it is
usually far more efficient to generate the file revisions file-by-file
(non-chronologically) during the initial parse of the CVS files, and
refer to the revisions by hash for the rest of the conversion.  In
October I added a bunch of hooks to cvs2svn to make this possible.  Work
remaining: code to reconstruct file text from CVS text + deltas,
including proper handling of line-end conventions and keyword
expansion/unexpansion, and of course the code to output the
reconstructed snapshots in a git-consumable format.

This is a major benefit for git conversion, but it hasn't been a big
issues with the cvs2svn code. Hooks will be helpful.

--
Jon Smirl
jonsmirl@xxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html