Re: GSoC proposal for svn remote helper

Jonathan Nieder <jrnieder@xxxxxxxxx> · Fri, 8 Apr 2011 00:21:26 -0500

(+cc: Eric who brought us git-svn)
Hi Dmitry,

Dmitry Ivankov wrote:

> This is the second iteration of my GSoC proposal

Great; let's iron this out.

> I would like to work on "Remote helper for Subversion and git-svn".
> My major motivation is to make git-svn repository easy to clone, and to make
> git-svn (fetch) faster on huge repositories.

So, my new first impression is that this goal might make things hard[1].

I think replacing git-svn with an imperfect emulation would not leave
people happy.  Existing configurations need to continue to work.

> Project Goals:
> + * Design and create fully functional prototype of new git-svn which is
> cloneable and quite fast.

*If* one does not have this goal ("new git-svn") then there is a
chance to move past some of git-svn's limitations[2].

All that said, these tools could be used to speed up git-svn.  

> By fully functional I mean that it'll be
> able to fetch, push, etc. but probably won't have automatic tags and
> branches discovery and like, but will allow it to be implemented on
> top. Oh, it just hit me that given a path (read trunk) to track and a
> svndump it looks trivial to discover all it's branches - just seek for
> copies.

As mentioned before, this sounds very ambitious.  Once we have a
timeline showing how this breaks down into small steps it should
hopefully be clearer way.

> + * Get all the needed core git changes merged.

The following is probably controversial.  It's my opinion only.

Since you can't control what other people do, I don't think it's right
to judge your project's success or failure based on whether it gets
merged.  Put another way, the product of your work that can be judged
is not whatever fraction gets accepted in git.git by the end of the
summer[3].

So I think the goal is whatever it is (a working and suitable "git
clone svn://foo" command, say) and getting feedback by pushing changes
upstream and responding to it is a part of how that happens.

At some point there will probably be a point of no return --- "if the
design of this patch is not right, I would have to rewrite everything
on top of a redesign of it".  I'd encourage getting input on such
patches _very_ early and working hard to get them merged at least to
"next" (i.e., to have a rough consensus that they are suitable modulo
small tweaks).  I would love it if the proposal included a timeline
pointing out some examples of this.

> Some of these exist already and
> only need help with polishing, reviewing and merging.

Do you mean support for parsing "svnadmin dump --deltas" output?  It
is already polished and reviewed; it's only sitting out-of-tree for
now because it makes the commandline usage awkward and it would be
nice to merge some improvements to that at the same time.

> + * Make the prototype as close to being merged as possible.

That's kind of vague, you know. :)

> Milestones for prototype functionality:
[list of features snipped]

Could you say something about how you would go about implementing
these?

Sorry for the ramble, and thanks for working on this.

Ciao,
Jonathan

[1] git-svn.perl is a work of art and a wonder to behold, and if your aim
is to make a compatible replacement for it, the first step will be to
understand its design deeply.  And the thing is, that much, while
valuable anyway, is pretty hard already.

You see, "git svn" has heuristics for

 - matching up git history to svn history by reading commit messages;
 - pushing mergy history as linear history by rebasing internally
   (dcommit);
 - finding the branches, merges, branch renames, and so on in an
   imperfectly structured history (find_parent etc)
 - what particular paths are relevant  (--ignore-paths)

and maintains some of its own data in the repository:

 - a configuration scheme and wide variety of supported configurations;
 - a log for unhandled pieces of history;
 - a cache mapping svn revision numbers to git commits

and people rely a lot on an odd coincidence:

 - using "git svn clone" twice with the same configuration on the same
   repository will, at least most of the time, give the same commit
   names.

[2] Well, it mostly comes down to one limitation.  To give a quick
sketch:

If I clone a repository with "git svn", then I am in a way a
second-class citizen.  The history shown with "git log" is filled
with "git-svn-id:" lines that are not very interesting to me (the
revision number is still interesting, of course).  I cannot use
"git push" to push my work, and in fact I cannot push my work as a
branch reflecting the real development history at all --- I have to
rebase it at the same time as pushing.  Whenever I push, the commit
names for my work change, so other branches based on my work don't
show up in "gitk" as based on my work any more.

Wouldn't it be nicer to be able to do

 alice$ git clone svn::http://svn.apache.org/repos/asf/subversion
 alice$ cd subversion
 alice$ ... hack hack hack ...

 bob$ git clone 'alice:~/src/subversion'
 bob$ cd subversion
 bob$ ... hack hack hack ...;	# make some changes on top of alice's work

 alice$ git fetch origin; # anything new upstream?
 alice$ git push origin; # push my changes upstream

 bob$ git remote add upstream svn::http://svn.apache.org/repos/asf/subversion
 bob$ git fetch upstream
 bob$ # push my changes on top of alice's (which were already pushed):
 bob$ git push upstream

That is the dream.  Because there is not a clearly appropriate
one-to-one mapping between possible svn histories and possible git
histories, there are going to have to be limitations[1], but that is
an ideal to strive for.

Sounds hard, maybe?  Yeah, it is, but getting at least fetch support
using the tools David and Ram made sounds easier to me than a fully
compatible replacement for git-svn.

[3] Meanwhile, just writing and publishing code is not enough, since
the code might have a fatal flaw that means no one will use it ("ivory
tour syndrome").  So what do I mean by the above?

As students work, I hope they will keep the mailing list posted on
their progress and find small pieces to review and merge early.  In
response they might get some questions and suggestions for
improvement; the response to these is just as important as the code.

On one hand this feedback is an important sanity check on the broad
features of your work and a means to get the details right for
inclusion in git (i.e., get it merged).  On the other hand, one should
not be tempted by interesting side tracks and avoid getting the actual
project done; you have to be able to say "no, I will not be working on
that".  Out of these conversations emerge better code and
documentation of the design in the form of list archives.

See [4] for a better explanation of this workflow.

[4] http://thread.gmane.org/gmane.comp.version-control.git/142623/focus=142877
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html