Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)

Stephen Bash <bash@xxxxxxxxxxx> · Tue, 19 Oct 2010 09:33:16 -0400 (EDT)

----- Original Message -----
> From: "Ramkumar Ramachandra" <artagnon@xxxxxxxxx>
> To: "Stephen Bash" <bash@xxxxxxxxxxx>
> Sent: Tuesday, October 19, 2010 2:42:15 AM
> Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
> 
> Stephen Bash writes:
> > I'm going to collapse all these comments because I think we're
> > coming at this from different angles. I agree, discovering the
> > copies in git is "easy" (albeit an n^2 operation), and git will
> > correctly identify file content. But when I was asked to preserve
> > the SVN history, I decided to extract a DAG from SVN and migrate
> > that DAG to Git. Thus the history itself is preserved (sans
> > merges), not just the contents of the files. This is the purpose of
> > buildSVNTree. I can elaborate further if requested.
> 
> Yep, they're certainly two different ways to approach the problem: I'd
> be interested in investigating why it will produce different
> results. Since we both agree that it's easier (and faster) to do it in
> Git-land, I'm looking into the the areas where it falls short.

Ack!  I left my example at home this morning... I'll explain it here, but perhaps I can actually send out a test script tonight or tomorrow (if there's need).  The basic premise is git's copy detection finds files with the same content, not necessarily the source of an SVN copy.

It's also possible you can do this in svn-fe or in fast-import -- there may be more information there.  I was looking strictly pre-svn-fe or post-fast-import...

Here's how I created a discrepancy between SVN and Git:
  1) Create a new svn repo
  2) Create the standard layout (trunk, branches, tags)
  3) Create multiple files on the trunk
  4) Create a branch (svn cp trunk branches/branchName)
  5) Edit a file on the branch (leave some of the others alone)
  6) (optional) edit a file on the trunk
  7) Merge the branch back to the trunk
  8) Create a tag from the trunk (svn cp trunk tags/tagName)
  9) git fast-import the repo

Now "svn log -v svn://svnrepo/tags/tagName" will show something like
  A /tags/tagName (from /trunk:rev)
OTOH "git log --name-status --find-copies-harder" will show something like
  C100 /tags/tagName/foo (from /trunk/foo)
  C100 /tags/tagName/bar (from /branches/branchName/bar)
  C100 /tags/tagName/baz (from /trunk/baz)
assuming bar is the file edited on the branch and then merged back to the trunk (this is all from memory, so please forgive me if the output isn't quite right).  I think from Git's point-of-view, this copy information is correct, but it doesn't describe SVN's history -- and I'm not entirely sure how a Git-only solution could identify precisely what's going on there... (hopefully I'm just being naive)

> > I found a 'db-svn-filter-root' branch, but it was not entirely
> > obvious to me what code I should be looking at...
> 
> Um, there's just one commit that deviates from the branch it's based
> on (but you don't know that, and I should have been clearer): look at
> contrib/svn-fe/svn-filter-root.py
> 
> It's just a minimalistic mapper, but it's fast and done nicely. You
> can use ideas from it when you're building yours.

Okay, David pointed me to that earlier, but I haven't dug into it yet.  I'll take a look.

> > I'm glad it's stimulating conversation. I'm beginning to wonder if
> > there might be competing design goals for one-way vs. two-way
> > compatibility... Performance is one place where opinions probably
> > greatly differ (I didn't mind taking an extra 30 minutes to mirror
> > my SVN repo because it probably saved more than that in
> > communication overhead later in the process, but that mirror
> > operation is very taxing on your timeline); my exhaustive search of
> > all SVN copies is another (I wanted to be *extremely* certain I knew
> > about all the misplaced branches/tags, but it's inefficient for a
> > casual developer who just wants to interact with an SVN server).
> > It's all just food for thought, and I'm happy to carry on the
> > conversation from my different point-of-view :)
> 
> Ok, I still don't get this part- why mirror at all? Can't all the
> information be mined out of the in-memory tree that svn-fe builds
> while parsing the dumpfile? From the SVN-side, all that's required is
> a streaming dumpfile like the one that `svnrdump dump` produces.

Oh, from that point of view the svn mirror is a bystander.  I was developing these tools at the same time as svnrdump (or at least prior to a stable version of svnrdump).  So when I found that running "svnadmin dump | svn-fe | git fast-import" on the server was taxing the system, I decided it was better to create a dump file, copy it to my local machine, and run svn-fe and fast-import locally.  Once I had the dump file, the local mirror sped up the SVN::Ra calls in buildSVNTree, and made any "did that really happen in svn?!" questions a little easier to answer.

Thanks,
Stephen
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html