Re: [GSoC update extra!] git-remote-svn: Week 8

Sam Vilain <sam@xxxxxxxxxx> · Thu, 01 Jul 2010 15:38:54 +1200

On Wed, 2010-06-30 at 14:45 +0200, Ramkumar Ramachandra wrote: 
> > I wrote at length about this near the beginning of the project;
> > essentially, figuring out whether particular paths are roots or not is
> > not defined, as SVN does not distinguish between them (a misfeature
> > cargo culted from Perforce).  It becomes a data mining problem, you have
> > this scattered data, and you have to find a history inside.
> 
> Right. Implementing git-svn on top of git-remote-svn might not be a
> bad idea.

That's a good way to look at it, yes.  Probably git-svn has more
svn-specific code than import rules, so just the interface like
--stdlayout etc is worth keeping, as well as checking that the svn
import data miner could do all the same things as git-svn.

> > I consider it very important to separate the data import and tracking
> > stage from the data mining stage.
> 
> We're following this approach. At the moment, we're just focusing on
> getting all the data directly from SVN into the Git store. Instead of
> building trees for each SVN revision, we've found a way to do it
> inside the Git object store: we're currently ironing out the details,
> and I'll post an update about this shortly.

Of course no working copy need exist with these contents within; that
would hardly be 'cheap copy' would it?  :)  But it's probably worth
sticking to the standard tree/blob/commit object convention for ease of
debugging etc.

> > Once the data mining stage is well solved, then it makes sense to look
> > at ways that a tracking branch which only tracks a part of the
> > Subversion repository can be achieved.  In the simple case, where no
> > repository re-organisation or cross-project renames have occurred it is
> > relatively simple.  But in general I think this is a harder problem,
> > which cannot always be solved without intervention - and so not
> > necessary to be solved in short-term milestones.  As you are
> > discovering, it is a can of worms which you avoid if you know you always
> > have the complete SVN repository available.
> 
> Right. I'm not convinced that it necessarily requires user
> intervention though: can you systematically prove that enough
> information is not available without user intervention using an
> example? Or is it possible, but simply too difficult (and not worth
> the effort) to mine out the data?

Sure, well all you really need to do is try it with a few real-world
repositories.

But I can give you a few examples of where all attempts at heuristics
will fail.

The first is where someone puts a file somewhere in the repository,
perhaps a README.txt or something, somewhere outside the regular
location.

  r1:
  add /README.txt

Then, someone comes along and starts making their project:

  r2:
  add /trunk/README.txt

How do you know that the first commit is not part of any project, but
some out-of-band notes to people working with the repository?

The way I approached all this in my perforce converter (remember,
Perforce is like SVN in almost every way) is to progressively scan the
history and build up two tables which trace the "mined" history.

You can see the table definitions at
http://utsl.gen.nz/gitweb/?p=git-p4raw;a=blob;f=tables.sql;h=259c243;hb=7e4fc4a#l205

The first, change_branches - records that a logical branch exists at a
revision and path.

  (branchpath, change)

(you might want another 'column' in your conceptual data model: the
project name; I was dealing with a single project).

There are also cases where someone does something dumb, and then it is
repaired on the next commit.

eg

  /trunk/ProjectA
  /trunk/ProjectB
  /branches/ProjectA/foo
  /branches/ProjectB/bar

Someone comes along and does something like:

  rm /trunk
  mv /branches/ProjectA/foo /trunk

Whoops!  The /trunk path just got wiped.  How do we fix it?  In a hurry,
the system administrator checks out the old revision, tars them all up,
then uses 'svn add' to put them back.

  rm -r /trunk/*
  add /trunk/ProjectA
  add /trunk/ProjectB

After this, people working on it realise the mistake: the disconnected
history won't merge, etc.  But the change is permanent, and they work
around this error in the history.  They don't want to do the more
correct thing, which is restore the history from the broken commit,

  rm -r /trunk/*
  cp /trunk@42 /trunk

They don't want to do this because SVN has taught them that version
control is a fragile thing, and you don't want to monkey with it.
Because it can break and then your whole world is changed as the
precious black box which all your work is going into doesn't work quite
as before.  Because there is no "undo".  Because it has all these opaque
files inside it no-one can understand.  What happened before with the
rename upset and embarrassed you, and you don't want to risk making it
worse.

This sort of thing does actually happen.  The lesson is that you can't
trust heuristics, or the revision control breadcrumbs (copied-from etc)
to be perfect.  They are invisible - impossible to inspect directly
using the SVN command-line API, and impossible to revise once they are
there.  by contrast, with git we have grafts, refs/replace/OBJID,
filter-branch, rebase, etc.  We have visualization tools, git-add -p,
git gui.  We have an object store which is robust, simple and widely
understood.  We have a simple data model, so that the actual information
can be understood by people and not just buggy software.

Of course with SVN you have the fact that for the entire of its life as
a relevant version control system, it did not support merge tracking.
So, most history being imported will not have any reliable merge
information.  If you read early versions of the SVN manual, they
actually advocate recording, in natural language, a human-readable
description of the work done, in the commit message.  I've seen people
working around this lack of functionality by developing their own
systems, sometimes not even being able to reconstruct what what merged
where (eg, in parrot SVN).

Yet another situation is partial merging; unlike SVN, Perforce had
detailed merge tracking from the very beginning.  With Perforce it
worked on a per-file level only, so it is slightly different in that
respect.  But what you find is that sometimes, people will merge only a
part of another tree in to their "trunk" at a time.

  r45: merged
  /branches/ProjectA/src -> /trunk/ProjectA/dest

  r46: merged
  /branches/ProjectB/doc -> /trunk/ProjectB/doc

What I normally find in this case is that there is no useful history
recorded in those intermediate commits; they were just committing to
save their intermediate work from being lost.  This doesn't happen quite
so much in Perforce, because it has a concept of "index" missing
entirely from (the user API of) Subversion.  In that case, it makes more
sense to omit the intermediate commits, and simply record a single merge
and leave out the intermediate commits.

To work around parenting mistakes - both those caused by misuse and from
a lack of SVN functionality, you need to be able to readily and easily
revise the parent information.

To do this, the second important table in git-p4raw recorded parents;

  (branchpath, change, parent_branchpath, parent_change)
or, if I was stitching on a pre-perforce or otherwise manually converted
history;
  (branchpath, change, parent_sha1)

So, in the Perl code, I wrote a command called "find_branches", which
correlates the information already there in the database with the
changes for that revision, and progressively looks for new revisions.
It also creates provisional parent information based on the integration
breadcrumbs.

What I would then do is look at the result in 'gitk', and if there were
problems, they could usually be fixed by fiddling with the parent
information, rewinding the export (see 'unexport_commits') and
re-running it.  Sometimes this meant adding a missing merge parent,
sometimes my fuzzy logic for guessing the merge parents guessed badly.
Obviously, I was also developing the importer along the way, so as well
as data errors there were bugfixes to make, etc.

This was not arduous; the speed of postgres' query evaluation (with some
tuning) and git fast-import meant I was typically exporting at several
*hundred* commits per second.

As I had a facility to graft manually converted history using git-p4raw
(above, that's where the "parent" is a commit SHA1, not a Perforce
revision number and path), I even went back and found various changes in
Perforce that looked more like incremental Perl releases, and run the
script I had for pre-perforce history over the diff and changelog
contained within (by then, it had a personality; it was called the
Timinator: http://perl5.git.perl.org/perl.git/tag/timinator).

Anyway, with that information in place, you then have all the
information you need to do a test export.  The exporter already has all
of the blobs in the git repository; all it has to do is refer to these
in a fast-export stream.  It marks as it goes along; once it has
finished an export batch, it waits for the fast-import process to finish
successfully, reads all of the SHA1s corresponding to the marks it
already emitted, and then updates the database tables with the SHA1s
accordingly.  Due to extensive use of deferred check constraints, only
then will Postgres let it commit :-).  That way, when I hit "ctrl+c"
along the way, I knew everything was safe.  Restartability and
robustness in the face of crashes is very useful for this sort of tool.

Another strange case which affects some of the largest repositories in
the world I don't have an answer for, but suspect it can be represented
by either subtree merging or by submodules:

  mv /trunk/ProjectA /trunk/ProjectB/lib/ProjectA

"ProjectA" is now included in "ProjectB" - what is the intent of this?
The first possibility is a subtree merge, the second is that a submodule
is desired.  How to represent it in git will depend on what happens
later.  If the directory is moved or copied elsewhere, then it is
probably going to be better to represent it as a submodule.

And here, the lesson is: people use SVN in ways which defy a single
mapping into git.  This one in particular affects the KDE project
heavily, as directories are copied around extensively.  SVN can remember
the history and produce logs, but it requires the entire repository
available to be able to do so.  Thiago wrote a tool called
svn-fast-export-all, which hoped to parse the svnadmin dump file and
split the data into separate repositories as it went, but as it is a
very long batch job it is difficult to produce a high quality
conversion.

Important points to take from this;

  * model the source data cleanly, completely and robustly.
  * start with heuristics, hopefully they will work for people following
the SVN guide, but allow for human input for when it doesn't.
  * aim for quick export/rewind, and robust operation.
  * this will make it very easy for revisionists to clean up the
mistakes of the past

Keep up the good work!
Sam

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html