Re: cvsps, parsecvs, svn2git and the CVS exporter mess

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Sun, 06 Jan 2013 12:15:24 +0100

On 01/05/2013 04:11 PM, Eric S. Raymond wrote:
> Perhaps I was unclear.  I consider the interface design error to
> be not in the fact that all the blobs are written first or detached,
> but rather that the implementation detail of the two separate journal
> files is ever exposed.
> 
> I understand why the storage of intermediate results was done this
> way, in order to decrease the tool's working set during the run, but
> finishing by automatically concatenating the results and streaming
> them to stdout would surely have been the right thing here.

cvs2svn/cvs2git is built to be able to handle very large CVS
repositories, not only those that can fit in RAM.  This goal influences
a lot of its design, including the pass-by-pass structure with
intermediate databases and the resumability of passes.

The blobfile necessarily contains every version of every file, with no
delta-encoding and no compression.  Its size can be a large multiple of
the on-disk size of the original CVS repository.  If the "save to
tempfiles then cat tempfiles at end of run" behavior were hard-coded
into cvs2git, then there would be no way to avoid requiring enough
temporary space to hold the whole blobfile.

Writing the blobfile into a separate file, on the other hand, means that
for example the blobfile could be written into a named pipe connected to
the standard input of "git fast-import" [1].  "git fast-import" could
even be run on a remote server.

I consider these bigger advantages than the ability to pipe the output
of cvs2git directly into another command.

> The downstream cost of letting the journalling implementation be
> exposed, instead, can be seen in this snippet from the new git-cvsimport
> I've been working on:
> 
>     def command(self):
>         "Emit the command implied by all previous options."
>         return "(cvs2git --username=git-cvsimport --quiet --quiet --blobfile={0} --dumpfile={1} {2} {3} && cat {0} {1} && rm {0} {1})".format(tempfile.mkstemp()[1], tempfile.mkstemp()[1], self.opts, self.modulepath)
> 
> According to the documentation, every caller of csv2git must go
> through analogous contortions!  This is not the Unix way; if Unix
> design principles had been minimally applied, that second line would
> just read like this:
> 
>      return "cvs2git --username=git-cvsimport --quiet --quiet"

Never in my worst nightmares did I imagine that my terrible design taste
would force you to type an extra two lines of code.  Oh the humanity!

By the way, patches are welcome.  And you don't need to trumpet their
imminent arrival [2] or malign the existing code beforehand.  Moreover,
it would be adequate if you just demonstrate working code and *then* ask
for "sign-in", rather than the other way around.

> If Unix design principles had been thoroughly applied, the "--quiet
> --quiet" part would be unnecessary too - well-behaved Unix commands
> *default* to being completely quiet unless either (a) they have an
> exceptional condition to report, or (b) their expected running time is
> so long that tasteful silence would leave users in doubt that they're
> working.

cvs2git is not a command that one uses 100 times a day.  It is a tool
for one-shot conversions of CVS repositories to git.  These conversions
can take hours or even days of processing time (not to mention the time
for configuring the conversion and changing the rest of a project's
infrastructure from CVS to git).  So yes, I think we would like to
appeal to (b) and humbly ask for your permission to give the user some
feedback during the conversion.

> (And yes, I do think violating these principles is a lapse of taste when
> git tools do it, too.)
> 
> Michael Haggerty wants me to trust that cvs2git's analysis stage has
> been fixed, but I must say that is a more difficult leap of faith when
> two of the most visible things about it are still (a) a conspicuous
> instance of interface misdesign, and (b) documentation that is careless and
> incomplete.

The cvs2git documentation is lacking; I admit it (as opposed to the
cvs2svn documentation, which I think is quite complete).  And the
program itself also has a lot of rough edges, for example its inability
to convert .cvsignore files into .gitignore files.  Patches are welcome.
 I haven't used cvs2svn for my own purposes in many years and I've
*never* once had a need to use cvs2git; I maintain these programs purely
as a service to the community.  Most of the community seems satisfied
with the programs as they are, and if not they usually submit courteous
and concrete bug reports or submit patches.

I request that you follow their example.  I especially ask that you
restrain from spreading public FUD about imagined problems based on
speculation.  Please do your tests and *then* report any problems that
you find.

Yours,
Michael

[1] In fact, the current implementation of generate_blobs.py sometimes
seeks back to earlier parts of the blob file when it needs the fulltext
of a revision that has already been output, but this would be easy to
change as soon as somebody needs it.

[2] http://comments.gmane.org/gmane.comp.version-control.git/212340

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html