Update on SoC proposal: git-remote-svn

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Sam Vilain commented on my SoC proposal using Google's SoC interface,
and requested me to CC my response to the mailing list. His original
comment is also quoted below.

----------------------->8----------------------<8-------------------
Hi Sam,

> Hi Ramkumar, I've looked at this proposal and seen that it differs a
> bit from the version on the list, and I can't see the relevant
> discussion, so I'll just throw my bit in here - though note this is a
> technical comment and not a critique of your proposal.

There's been a lot of discussion, some on the list, and some more off
the list.

> First, consider not using the SVN API at all while prototyping the
> import part of the chain.  Instead, parse the 'svnadmin dump' stream
> from a local mirror.  This will allow you to tackle the actual
> problems involved and importing the data effectively, without
> suffering from the brain-damage that is the SVN API.  After all, the
> SVN API should be returning you all of the same information that the
> dump stream does, so you can treat making it work using the remote
> access API (eg svn_ra_replay, which is faster for mirroring AIUI) as a
> separate task.  You will also more easily spot information which you
> should be extracting from the API, but aren't - it's definitely all in
> the dump format; it has to be.  I received similar advice to this
> before building a perforce importer and let's just say it was
> invaluable.

Yes. I've studied the SVN API, and I agree with you- it's quite
horrible. Instead of providing a API that's transparently
backward-compatible, they've provided different methods for different
versions. There are also several variations of certain methods, and
this is quite confusing.

`svnadmin dump` is exactly how I plan to start out- I've already
discussed this with my to-be mentor, Sverre Rabbelier and David
Michael-Barr, who's building a new SVN exporter in his own time.

> Second, consider making the mirror phase emit directly to a tracking
> branch via git-fast-export, that is not intended to be checked out.
> Instead, it contains trees which correspond to revisions in the
> mirrored SVN repository.  Directory and file properties can be saved
> in the tree using specially named dotfiles, and revision properties
> can be saved in the commit log.  Perhaps I misread your intentions
> with the "stripped down svnsync" part, but syncing to a local SVN
> repository seems to me like a waste of time; people can just do that
> themselves if they choose anyway.  An SVN repository can easily be 10
> times the size of the corresponding git store, and it just seems like
> double-handling of the data and will make the whole process slower and
> more cumbersome than it needs to be.

> With all the blobs already in the git store, and the information
> needed to perform the data mining operation which is the extraction of
> git-style branch histories from the svn data, you will be working with
> data which is all in git-land, and exporting referencing blobs which
> are already in the store.  This will save you a LOT of time, as it
> means in this stage you are not handling the actual file images; just
> constructing branch histories in the git-fast-import stream.  Your
> branch miner will potentially be able to process thousands of
> revisions per second this way, even from python.

Agreed. Sverre, David and I discussed exactly this- The final version
of mirror-client will dump all the SVN information to a Git store
first, so we can do the mapping painlessly in Git. There are some
concerns about information loss though, which we'll have to deal with
as we go on.

> Also bear in mind
> that people might use SVN in a way that violates the expectations of
> this branch miner.  An example is putting a README file in the
> top-level projects directory, a heuristic approach might consider that
> the start of a new project and then mess up later stages.  Another
> example is people accidentally deleting trunk and re-adding it; the
> nice thing about this two-stage approach is that it allows advanced
> users to muck with the "raw" data (ie, this whole repository tracking
> branch) using git to do things like graft away the bad revisions, and
> then the second stage will use the corrected data.  Of course
> eventually, this detail will be hidden by the remote helper.

Excellent suggestion! I'll attempt to build the plumbing for the
mapping in a manner that exposes a sane interface.

> As a general comment - you must be careful in trying to assume that
> what you are attempting is even possible.  Sure, you want 'git clone
> svn://example.com/myrepo' to work, but what does that mean?  A
> repository in SVN is a filesystem, which can contain multiple
> projects.  In git, a repo is a single project.  People might expect to
> be able to clone the trunk URL for instance.  My advice there is to
> not support that use case at all, it's a complete can of worms which
> you will discover as you tackle the conversion algorithms.  Just focus
> on making the case where the complete repository is mirrored work for
> this project.  Mining a single branch out of SVN without all data
> available is the domain of git-svn and really you don't want to go
> there.

Hm, this is something that I hadn't thought about earlier. Thanks for
the suggestion- I will not attempt to go into complicated cases,
atleast in my summer term.

> Anyway like I say, please follow-up on the mailing list, and this
> advice can receive wider scrutiny.

Thank you for your valuable comment!

-- Ram
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]