Re: I have end-of-lifed cvsps

Johan Herland <johan@xxxxxxxxxxx> · Tue, 17 Dec 2013 18:52:09 +0100

On Tue, Dec 17, 2013 at 3:58 PM, Eric S. Raymond <esr@xxxxxxxxxxx> wrote:
> Johan Herland <johan@xxxxxxxxxxx>:
>> HOWEVER, this only solves the "cheap" half of the problem. The reason
>> people want incremental CVS import, is to avoid having to repeatedly
>> convert the ENTIRE CVS history. This means that the CVS exporter must
>> learn to start from a given point in the CVS history (identified by
>> the above mapping) and then quickly and efficiently convert only the
>> "new stuff" without having to consult/convert the rest of the CVS
>> history. THIS is the hard part of incremental import. And it is much
>> harder for systems like CVS - where the starting point has a broken
>> concept of history...
>
> I know of *no* importer that solves what you call the "deep" part of
> the problem.  cvsps didn't, cvs-fast-import doesn't, cvs2git doesn't.
> All take the easy way out; parse the entire history, and limit what
> is emitted in the output stage.

Yes, and starting from a non-incremental importer, that's probably the
only viable way to approach incrementalism.

> Actually, given what I know about delta-file parsing I'd say a "true"
> incremental CVS exporter would be so hard that it's really not worth the
> bother.  The problem is the delta-based history representation.
> Trying to interpret that without building a complete set of history
> states in the process (which is most of the work a whole-history
> exporter does) would be brutally difficult - barely possible in
> principle maybe, but I wouldn't care to try it.

Agreed, you would either have to re-parse the entire ,v-file, or you
would have to store some (probably a lot of) intermediate state that
would allow you to resolve deltas of new revisions without having to
parse all the old revisions.

> It's much more practical to tune up a whole-history exporter so it's
> acceptably fast, then do incremental dumping by suppressing part of
> the conversion in the output stage.
>
> cvs-fast-export's benchmark repo is the history of GNU troff.  That's
> 3057 commits in 1549 master files; when I reran it just now the
> whole-history conversion took 49 seconds.  That's 3.7K commits a
> minute, which is plenty fast enough for anything smaller than (say)
> one of the *BSD repositories.

Those are impressive numbers, and in that scenario, using a
"repurposed" converter (i.e. whole-history converter that has been
taught to do incremental output) is undoubtedly the best solution.

However, I fear that you underestimate the number of users that want
to use Git against CVS repos that are orders of magnitude larger (in
both dimensions: #commits and #files) than your example repo. For
these repos, running a proper whole-history conversion takes hours -
or even days - and working incrementally on top of that is simply out
of the question. Obviously, they still need the whole-history
converter for the future point in time when they have collected enough
motivation/buy-in to migrate the entire project/company to a better
VCS, but until then, they want to use Git locally, while enduring CVS
on the server.

At my previous $DAYJOB, I was one of those people, and I ended up with
a two-pronged "solution" to the problem (this is ~5 years ago now, so
I'm somewhat fuzzy on the details):

 1. Adopt an ad hoc incremental approach for working against the CVS
server: Keep a CVS checkout next to my git repo. and maintain a map
between corresponding states/commits in CVS and git. When I update
from CVS, apply the corresponding patch to the "cvs" branch in my git
repo. Rebase my git-based work on top of that, and use "git
cvsexportcommit" to propagate my Git work back to CVS. This is crude
and hacky as hell, but it provides me a local git-based workflow.

 2. Start convincing fellow developers and lobby management about
switching away from CVS. We got a discussion started, gained momentum,
and eventually I got to spend most of my time preparing and performing
the full-history conversion from CVS to git. This happened mostly
before cvs2svn grew its cvs2git sibling, so I ended up writing a
custom converter for our particular variation of insane and demented
CVS practices. Today, I would probably have gone for cvs2git, or your
more recent work.

But back to my main point:

I believe there are two classes of CVS converters, and I have slowly
come to believe that they solve two fundamentally different problems.
The first problem is "how to faithfully recreate the project history
in a different VCS", which is solved by the full-history converters.
Case closed.

The second problem is somewhat harder to define, but I'll try: "how to
allow me to work productively against a CVS server, without having to
deal with the icky CVS bits". Compared to the first problem, the
parameters differ somewhet:

 - Conversion/synchronization time must be short to allow me to stay
productive and up-to-date with my colleagues.

 - Correctness of "current state" is very important. I must be sure
that my git working tree is identical to its CVS counterpart, so that
my git changes can be reproduced in CVS as faithfully as possible.

 - Correctness of "history" is less important. I can accept a
messy/incorrect Git history, since I can always query the CVS server
for the "correct" history (whatever that means in a CVS context...).

 - As a generic CVS user (not the CVS admin) I don't necessarily have
direct access to the ,v files stored on the CVS server.

Although a full-history converter with fairly stable output can be
made to support this second problem for repos up to a certain size,
there will probably still be users that want to work incrementally
against much bigger repos, and I don't think _any_
full-history-gone-incremental importer will be able to support the
biggest repos.

Consequently I believe that for these big repos it is _impossible_ to
get both fast incremental workflows and a high degree of (historical)
correctness.

cvsps tried to be all of the above, and failed badly at the
correctness criteria. Therefore I support your decision to "shoot it
through the head". I certainly also support any work towards making a
full-history converter work in an incremental manner, as it will be
immensely useful for smaller CVS repos. But at the same time we should
realize that it won't be a solution for incrementally working against
_large_ CVS repos.

Although it should have been made obvious a long time ago, the removal
of cvsps has now made it abundantly clear that Git currently provides
no way to support the incremental workflow against large CVS repos.
Maybe that is ok, and we can ignore that, waiting for the few
remaining large CVS repos to die? Or maybe we need a new effort to
fill this niche? Something that is NOT based on a full-history
converter, and does NOT try to guarantee a history-correct conversion,
but that DOES try to guarantee fast and relatively worry-free two-way
synchronization against a CVS server. Unfortunately (or fortunately,
depending on POV) I have not had to touch CVS in a long while, and I
don't see that changing soon, so it is not my itch to scratch.

...Johan

-- 
Johan Herland, <johan@xxxxxxxxxxx>
www.herland.net
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html