Re: Approaches to SVN to Git conversion

Stephen Bash <bash@xxxxxxxxxxx> · Tue, 06 Mar 2012 09:36:31 -0500 (EST)

----- Original Message -----
> From: "Andrew Sayers" <andrew-git@xxxxxxxxxxxxxxx>
> Sent: Monday, March 5, 2012 6:27:32 PM
> Subject: Re: Approaches to SVN to Git conversion
> 
> > My current thinking (and this is very much open for discussion) is
> > that as long as the SVN properties are available (especially the
> > copyfrom information) Git has just as much information (if not more)
> > to reconstruct the SVN history as SVN does.  (And going through our
> > messy history I haven't found any counterpoint to this yet)
> 
> I agree that git can be taught a superset of the information in SVN,
> but you'll need absolutely all SVN properties available...

I'm pretty sure Jonathan won't be happy with anything less ;)

> I wrote my SVN exporter based on SVN dumps for three reasons - I
> figured people switching from SVN would be more comfortable
> customising a solution that only used technologies they understood, I
> figured it might be useful to Mercurial or Bazaar some day if it was
> DVCS-neutral, and I have to use SVN for my day job so I'm more
> interested in getting a good migration story today than a great one
> tomorrow.

The multiple systems argument is a good one.

> >   my %branch_spec = { '/trunk/projname' => 'master',
> >                       '/branches/*/projname' => '/refs/heads/*' };
> >   my %tag_spec = { '/tags/*/projname' => '/refs/tags/*' };
> > 
> > Now I know this simple mapping will fail as I get further in our
> > history -- in particular we have one branch that came from:
> > 
> >   svn cp $SVN_REPO/trunk/ $SVN_REPO/foo  # OOPS! not in branches!
> >   svn mv $SVN_REPO/foo $SVN_REPO/branches/foo
> > 
> > It's then up to the user to modify the branch
> > map to something that accounts for this behavior:
> > 
> >   my %branch_spec = { '/trunk/projname' => 'master',
> >                       '/branches/*/projname' => '/refs/heads/*',
> >                       '/foo' => '/refs/heads/foo' };
> >   my %tag_spec = { '/tags/*/projname' => '/refs/tags/*' };
> 
> I started with an approach like you describe, but as you say it winds
> up in a mess of special cases.  A friend pointed me to Perl's catalyst
> repository[2], which is a wonderful haven of every mad SVN thing ever
> dreamt up.  That got me playing with more general heuristics, and
> while writing this e-mail I think I've finally nailed it.  What do you
> say to defining SVN branches like this:
> 
> A directory is a branch if...
> 1. it is not a subdirectory of an existing branch; and
> 2. either:
> 2a. it is in a list of branches specified by the user, or
> 2b. it is copied from a (subdirectory of a) branch

I think I started with a very similar set of rules...  Looking at my code now I'm having a hard time summarizing them (probably because they evolved with the code, so what started simple morphed into something pretty complicated).  I guess as long as the user has the option to say "no, don't treat this copy as a branch" (or equivalently the Git side of things has a way to say "ignore this branch") these rules would be okay.  But at that point we're back to a list of exceptions -- really we're arguing white-list vs black-list... I eventually chose to go the white-list route for our conversion after starting with black-list (a white-list that still required a few manual edits before manipulating the Git history).  So take that single data point for what it's worth.

> > > Once the format is defined, git import is fairly straightforward.
> > > Proof-of-concept code to follow, but it's really just a wrapper
> > > around git-commit-tree, git-mktag etc.  I wrote this in Perl
> > > thinking it would relate somehow to git-svn, but eventually
> > > realised it didn't and that a few hundred calls to (plumbing)
> > > processes per second isn't so good for performance.  The only
> > > interesting part of the problem is how to tackle SVN tags.  I went
> > > for an ambitious approach, making normal tags where possible and
> > > downgrading them to lightweight tags when necessary.  This does
> > > involve managing something that is effectively a branch in
> > > refs/tags/, but what else is an SVN tag but a branch in the wrong
> > > namespace?
> > 
> > I don't understand how "normal" and "lightweight" apply in this
> > situation? ... In the case of actual content changes in a tag's
> > life, I think it's up to the user to decide between three options:
> > 
> >   1) only retain the last SVN tag
> >   2) tag using the git-svn-style 'tagname@rev' for all but the last
> >   3) Do (2), but move older tags to some hidden namespace
> >      (refs/hidden/tags or the like)
> > 
> > ... In the bidirectional case things get murky (maybe always tag
> > with tagname@rev and hope for tab completion?).
> 
> I didn't explain this particularly well, as it's based largely on the
> vague desire to make update work some day.  Imagine the user does
> this:
> 
> * git svn-pull # get tags/foo, a candidate for an annotated tag
> ... time passes ...
> * git svn-pull # tags/foo has now been updated in another revision
> 
> If we create an annotated tag in step 1, what do we do in step 2?  You
> can't make the tag object the parent of a new revision, so you need to
> do something unpleasant.  The solution I proposed was to convert the
> tag message to a commit message (i.e. pretend a lightweight tag had
> been created all along), then add another commit on top of it and make
> a lightweight tag from the new commit (i.e. treat it like a branch).
> In retrospect that's far too much magic without user involvement - a
> better solution would be to give the user this option along with the
> ones you outlined, and let git-config remember their preference if
> they want.

Okay, that's what I thought you meant (and what I classified as a bidirectional problem, but I guess it's not strictly a bidirectional problem, but a one-time migration does not have the problem).  If you want to continue to update Git from SVN there are two cases to consider:

  1) Each Git repository *only* talks to SVN
  2) The Git repository is cloned for further use 
     (So the chain is something like SVN->Git->Git)

In (1) your lightweight tag solution is probably okay (but I'm pretty sure creating/deleting annotated tags would behave the same way because no one else sees the Git tag object).  In (2) I think there would still be a tag conflict when the upstream Git repo replaces a lightweight tag and the downstream repo attempts to fetch it.  I don't know what the fetch/pull machinery does when there's a lightweight tag conflict (I'm guessing either bails out or keeps the local one?).  Case (2) motivates me to say always generate (annotated?) tags named tagname@rev so there can be no conflicts.  In that case the only difference I see is if we create an empty Git commit with the tag message plus a lightweight tag or tag the original commit with an annotated tag (I think it's fairly obvious I'm a fan of the latter).

> [1] http://en.wikipedia.org/wiki/Full_employment_theorem
> [2] http://dev.catalyst.perl.org/repos/bast/

Thanks,
Stephen
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html