Re: [PATCH 0/4] Add more tests of cvsimport

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Sat, 21 Feb 2009 07:32:09 +0100

Samuel Lucas Vaz de Mello wrote:
> Michael Haggerty wrote:
>> BTW, I don't want to trash "git cvsimport".  I'm not brave enough even
>> to try to implement incremental conversions in cvs2git.  So the fact
> 
> If I run cvs2git several times against a live cvs repo (using the
> same configuration), wouldn't it perform an incremental import?
> Is there anything that would make it produce different commits for
> the history?
> 
> I've just made a simple test here performing 2 imports (the 2nd with a
> dozen of new commits not in the 1st) and it seemed to work fine.
> 
> I know that it will take the same time/memory as the first import,
> but is there something that can break the repository or produce wrong
> data?

Cool, I'd never thought of that.  It's certainly not by design, but as
you've discovered, the interaction of cvs2git and git *almost* combine
to give you an incremental import.

Alas, it is only "almost".  There are many things that can happen in a
CVS repository that would cause the overlapping part of the history to
disagree between runs of cvs2svn.  The nastiest are things that a VCS
shouldn't really even allow, but are common in CVS, like

- Retroactively adding a file to a branch or tag.  (This is a
much-beloved feature of CVS.)  Since CVS doesn't record the timestamp
when a symbol is added to a file, cvs2git tries (subject to the
constraints of other timestamps) to group all such changes into a single
changeset.  So the creation of the symbol would look different in runs N
vs N+1 of cvs2git--containing different files and likely with a
different timestamp.

- Renaming a file "with history" by renaming or copying the associated
*,v file in the repository.  This retroactively changes the entire
history of that file and thus of all changesets that involved changes to
that file.

- Changing the "text vs binary" or keyword expansion mode of a file.
These properties apply to all revisions of a file, and therefore also
have a retroactive effect.

But even aside from these retroactive changes, the output of cvs2git is
not deterministic in any practical sense (though I've tried to make it
deterministic given *identical* input).  The problem is that there are
so many ambiguities in a CVS history (because CVS doesn't record enough
information) that cvs2git has to use heuristics to decide what
individual file events should be grouped together as commits.  The
trickiest part is that the graph of naively inferred changesets can have
cycles in it, and cvs2git uses several heuristics to decide how to split
up changesets so as to remove the cycles.  (See our design notes [1] for
all the hairy details.)  The CVS commits made between runs N and N+1
could easily change some of the heuristics' decisions, giving different
results even for the overlapping part of the history.

To add robust support for incremental commits to cvs2git would require
run N+1 to know about the decisions made in run N, to avoid
contradicting them.

I wonder what would happen if one would treat the results of cvs2git
conversions N and N+1 as two separate repositories and merge them using
git.  In many cases the merge would probably be trivial, and most
conflicts (except retroactive file renaming!) would probably tend to be
in the recent past and therefore resolvable manually.  At least the
repository shouldn't silently become corrupted, which can happen with
other incremental conversion tools.

The final problem is that cvs2git conversions of large CVS repositories
are quite time-consuming, so using it for incremental conversions of
large repositories would be painful.  No doubt it could be speeded up
considerably, especially if conversion N+1 was privy to the results of
conversion N.

These are all challenging problems and I would welcome volunteers and be
happy to get them started.

Michael

[1] http://cvs2svn.tigris.org/svn/cvs2svn/trunk/doc/design-notes.txt
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html