Re: GSOC Proposal draft: git-remote-svn

Andrew Sayers <andrew-git@xxxxxxxxxxxxxxx> · Tue, 10 Apr 2012 23:30:21 +0100

On 10/04/12 18:17, Jonathan Nieder wrote:
<snip>
> Given the goal described here of an import with support for
> automatically detecting branches, here are some rough steps I imagine
> would be involved:

Just to be clear, my understanding is that this project will take SBL
created by another program (that I'm writing) and create branches as
specified.  This frees Florian from having to deal with the maze of edge
cases involved in that part of the problem.

> 
>  . baseline: remote helper in C
> 
>  . option to import starting with a particular numbered revision.
>    This would be good practice for seeing how options passed to
>    "git clone -c" can be read from the config file.
> 
>  . option or URL schema to import a single project from a large
>    Subversion repository that houses several projects.  This would
>    already be useful in practice since importing the entire Apache
>    Software Foundation repository takes a while which is a waste
>    when one only wants the history of the Subversion project.
> 
>    How should the importer handle Subversion copy commands that
>    refer to other projects in this case?

This is a good point.  I've just svnadmin and svnrdump, and it turns out
svnadmin doesn't allow you to dump a subtree while svnrdump strips out
the offending copy commands, so either way there's nothing to be done.

>  . automatically detecting trunk when importing a project with the
>    standard layout.  The trunk usually is not branched from elsewhere
>    so this does not require copyfrom info.  Some design questions
>    come up here: should the remote helper import the entire project
>    tree, too?  (I think "yes", since copy commands that copy from
>    other branches are very common and that would ensure the relevant
>    info is available to git.)  What should the mapping of git commit
>    names to Subversion revision numbers that is stored in notes say
>    in this case?
> 
>  . detecting trunk and branches and exposing them as different remote
>    branches.  This is a small step that just involves understanding
>    how remote helpers expose branches.

After last week's discussion about branch absorption, I tried writing
another algorithm over the weekend.  I plan to test it during the week,
but online detection of branches and trunks looks fairly practical in
most real world cases (even those that are sanitily challenged).

>  . storing path properties and copyfrom information in the commits
>    produced by the vcs-svn/ library.  How should these be stored?
>    For example, there could be a parallel directory structure
>    in the tree:

Yes, this is an important problem.  It became apparent over the weekend
that my code was I/O bound, so I started caching the metadata I need
(without e.g. file contents) in a gzipped file containing a list of JSON
blobs (one blob per revision).  That immediately caused the script to
jump from about a hundred revisions/second to a few thousand(!), and
each further size optimisation caused it to jump by another few thousand
per second.

This sort of speed is useful for the initial SVN->git conversion,
because it means even people with very large repositories can have a
quick edit/compile/test loop when they're looking for mis-detected branches.

Having said all that, a git directory is easier to examine and update
than a gzipped file.  I have no idea what the performance would be like,
but even if a directory was slower we could use gzipped JSON as a cache
layer during the initial import, then throw it away and read straight
from a git directory on update.

>  . tracing history past branch creation events, using the now-saved
>    copyfrom information.

I'm not sure if I understand correctly, but I think you're referring to
this edge case:

mkdir tronk brunches
svn add tronk brunches
svn ci -m "Initial commit, with typos to evade stdlayout detection"

mkdir tronk/libfoo
touch tronk/libfoo/main.c
svn add tronk/libfoo
svn ci -m "Created libfoo - no way to know this isn't a branch"

svn up # so the 'svn cp' works correctly below

svn cp tronk brunches/copy_of_tronk
touch brunches/copy_of_tronk/main.c
svn add brunches/copy_of_tronk/main.c
svn ci -m "Marking the copy as a branch, but what about the original?"

I'm not actually sure what the right behaviour is here.  You could argue
that once we know "copy_of_tronk" is a branch, it follows that "tronk"
itself is a branch.  On the other hand, these directories have diverged,
and who's to say it wasn't because of a disagreement about which
directory was the branch?

Branch absorption makes this problem less important - the "tronk/libfoo"
branch will be deleted and merged into the new "tronk" branch the moment
someone creates "tronk/main.c", which tends to happen pretty quickly in
the real world.

I'm open to suggestions, but my instinct right now is to say that
communicating branchiness back through a copyfrom should at least
require confirmation by the user.

>  . tracing second-parent history using svn:mergeinfo properties.

My old POC code did this, and I plan to include it in the work I'm doing
now.  I expect this to be the hardest single part of the project to
solve in the general case, because of SVN's troubled approach to merge
handling.

<snip>
> Another question is: what is the design for this?

Here's my part of the equation:

Right now I have a script that first takes an SVN dump and produces
gzipped JSON as output, then takes the gzipped JSON as input and
produces an SBL file as output.  The first round will generally only
need to be run once (and is comparable to svn-fe in speed), whereas the
second round might need to be run an arbitrary number of times (but is
very fast).

Incidentally, the initial cache generation is the only part that's still
tied to the SVN dump format, and I doubt it would be that hard for
someone to rewrite it inside svn-fe or to make it read from git metadata
in future.

I'm currently focussing on bringing all the modules up to release
quality, so that I can have something for Florian to play with in the
near future.  This should have an interface that is mature but flexible,
so I can change the interface to make his life easier but won't need to
change the interface because I missed something.  After that, I'll
concentrate on improving the quality of the SBL output.

	- Andrew
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html