Re: I have end-of-lifed cvsps

Johan Herland <johan@xxxxxxxxxxx> · Tue, 17 Dec 2013 22:26:57 +0100

On Tue, Dec 17, 2013 at 7:47 PM, Eric S. Raymond <esr@xxxxxxxxxxx> wrote:
> I'm working with Alan Barret now on trying to convert the NetBSD
> repositories. They break cvs-fast-export through sheer bulk of
> metadata, by running the machine out of core.  This is exactly
> the kind of huge case that you're talking about.
>
> Alan and I are going to take a good hard whack at modifying cvs-fast-export
> to make this work. Because there really aren't any feasible alternatives.
> The analysis code in cvsps was never good enough. cvs2git, being written
> in Python, would hit the core limit faster than anything written in C.

Depends on how it organizes its data structures. Have you actually
tried running cvs2git on it? I'm not saying you are wrong, but I had
similar problems with my custom converter (also written in Python),
and solved them by adding multiple passes/phases instead of trying to
do too much work in fewer passes. In the end I ended up storing the
largest inter-phase data structures outside of Python (sqlite in my
case) to save memory. Obviously it cost a lot in runtime, but it meant
that I could actually chew through our largest CVS modules without
running out of memory.

> It is certainly the case that a sufficiently large CVS repo will break
> anything, like a star with a mass over the Chandrasekhar limit becoming a
> black hole :-)

:) True, although it's not the sheer size of the files themselves that
is the actual problem. Most of those bytes are (deltified) file data,
which you can pretty much stream through and convert to a
corresponding fast-export stream of blob objects. The code for that
should be fairly straightforward (and should also be eminently
parallelizable, given enough cores and available I/O), resulting in a
table mapping CVS file:revision pairs to corresponding Git blob SHA1s,
and an accompanying (set of) packfile(s) holding said blobs.

The hard part comes when trying to correlate the metadata for all the
per-file revisions, and distill that into a consistent sequence/DAG of
changesets/commits across the entire CVS repo. And then, of course,
trying to fit all the branches and tags into that DAG of commits is
what really drives you mad... ;-)

> The question is how common such supermassive cases are. My own guess is that
> the *BSD repos and a handful of the oldest GNU projects are pretty much the
> whole set; everybody else converted to Subversion within the last decade.

You may be right. At least for the open-source cases. I suspect
there's still a considerable number of huge CVS repos within
companies' walls...

> I find the very idea of writing anything that encourages
> non-history-correct conversions disturbing and want no part of it.
>
> Which matters, because right now the set of people working on CVS lifters
> begins with me and ends with Michael Rafferty (cvs2git),

s/Rafferty/Haggerty/?

> who seems even
> less interested in incremental conversion than I am.  Unless somebody
> comes out of nowhere and wants to own that problem, it's not going
> to get solved.

Agreed. It would be nice to have something to point to for people that
want something similar to git-svn for CVS, but without a motivated
owner, it won't happen.

...Johan

-- 
Johan Herland, <johan@xxxxxxxxxxx>
www.herland.net
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html