Re: GSOC Proposal draft: git-remote-svn

Andrew Sayers <andrew-git@xxxxxxxxxxxxxxx> · Tue, 03 Apr 2012 22:53:45 +0100

On 03/04/12 01:09, Jonathan Nieder wrote:
> Andrew Sayers wrote:
> 
>> Sorry, that wasn't clear.  I meant commands that just expose a single
>> primitive bit of functionality (like git-commit-tree) instead of those
>> that present an abstract interface to the whole git machinery (like
>> git-fast-import).
> 
> Ok.  I think you are misunderstanding the purpose of fast-import[1] but
> it doesn't take away from what you're saying.

I had certainly missed the "ls" command - having seen that, I agree
fast-import is the best solution to this problem.  I'm still a bit
concerned about fast-import as a learning tool, although this is a bit
of a meta-conversation as far as GSoC is concerned.

Personally, I like to learn things by understanding the basic building
blocks, then seeing how to construct things from them.  I found git easy
to learn because I could start with the basic data structures and
algorithms, then layer an approximation of a patches-and-tarballs
workflow on top of it.  I would expect a discussion of the problem in
terms of primitive commands like git-commit-tree to help that learning
style, although I am committing a logical fallacy by assuming that
everyone thinks like me until proven otherwise :)

I think a lot of learners want to play a bit, make some informative
mistakes, then flesh out their understanding with something a bit more
technical.  People that want to "look under the hood" are well-served by
git, because they can use the ordinary interface
(status/commit/branch/etc.) then use the source when they're ready.  It
seems like people that want to "peek behind the curtain" at a
communication stream would be well-served by fast-import if only there
was a curtain for them to peek behind.  I'd be intereseted to know what
git learners think, but I'd feel more comfortable pointing students at
fast-import if there was a FUSE module, or a shell, or some other
interface on top of it whose failure mode was a puzzling mess instead of
a safely inert repository.

Incidentally Florian, some of the above probably spoke to you, other
bits probably less so.  It took me several years after leaving
university to see my own learning style, so if you find it hard to learn
git one way, try some different approaches before assuming it's a
personal problem :)

>> I agree it's possible to use fast-import for this problem, but it seems
>> like it's redundant after svn-fe has already loaded everything into git.
> 
> Right, I missed your point here before.  The fundamental question is
> not about what commands to use but about the order of operations.
> 
> 1. In one scheme, first you import the whole tree without splitting it
>    into branches, with a tool like svn-fe.  Afterwards, you
>    postprocess the resulting repository with tools like "git
>    filter-branch --subdirectory-filter".  The result of the import can
>    depend on all revisions --- you can say, in rev 1, "I'm not sure
>    whether this new directory is a branch; let me see how it develops
>    by rev 1000 to decide how to process it".
> 
> 2. In another scheme, you only import the subset of the repository
>    you are interested in.  This is what git-svn does, for example.
>    This requires the branch discovery to happen at the same time as
>    the import, because otherwise there is no way to tell what subset
>    of the repository you are actually interested in.
> 
> 3. Lastly, in yet another scheme, you import the whole tree and it is
>    split into branches on the fly.  The advantages relative to (1) are:
> 
>    - impatient people can peek at the partial result of the import as
>      it happens
> 
>    - the result of importing rev n is guaranteed to depend only on
>      revs <= n, so different people importing at different times will
>      get the same commits (assuming nobody is rewriting early history
>      behind the scenes) and it is obvious how to support incremental
>      importants to expand a repository with all revs <= n to a
>      repository with all revs <= 2n
> 
>    However, if splitting branches only can happen during the initial
>    import, that makes it harder to tweak the configuration and try
>    again to see what changes.
> 

That's a good way of putting the question, but for SVN it's useful to
distinguish between trunk and non-trunk branches.  I previously[1]
suggested this algorithm for deciding if a directory is a branch:

A directory is a branch if...
1. it is not a subdirectory of an existing branch; and
2. either:
2a. it is in a list of branches specified by the user, or
2b. it is copied from a (subdirectory of a) branch

This is a pretty solid heuristic for detecting branches copied from an
existing branch even in scheme (2) or (3), but does absolutely nothing
for trunk detection.  Although trunk detection is trivial in the sane
case (the "trunk" directory is the one and only trunk, end of story),
here's a contrived example for why it's hard in the general case:

Our SVN newbie created "scratchpad/libfoo/foo.c" in revision 1.  He
spends the next 1,000 revisions working in scratchpad/libfoo, creating
the fooiest foo that ever did foo.  After that, he creates
"scratchpad/libbar/bar.c" and continues for another thousand revisions.
 This cycle repeats until he's finally ready to tie all his libraries
together.  It's only now that he finally decides whether to create
"scratchpad/main.c" (if he thinks "scratchpad" is the trunk), or
"trunk/main.c" (if he thinks all the subdirectories of scratchpad were
trunks) or "scratchpad/main/main.c" (if he wants to give me an aneurysm
worrying how to cope when he does `svn cp scratchpad/main scratchpad`).

I paused after writing the paragraph above, because the last part got me
thinking.  Copying a subdirectory to its parent directory isn't actually
possible in SVN, but the concept of "branch absorption" is an
interesting one.  In theory, we could say that "scratchpad/libfoo" and
"scratchpad/libbar" were trunk branches at first, but were deleted when
the "scratchpad" branch was created.  I'll have to check whether this
leads to undesirable results in the real world, but this might make it
possible to do on-the-fly trunk detection as described in scheme (3).

> The relevant technical difference is that in the naive implementation
> of scheme (2) you can make use of arbitrary information available over
> svn protocol, in naive scheme (3) you can only use information that
> makes it into the fast-import stream, and in naive scheme (1) you can
> only use information that makes it into the actual git repository.  So
> to use scheme (1) you need to make sure svn-fe stores all interesting
> data in a visible way, including copyfrom info (which is not a bad
> idea anyway).

The approach I'm looking at is to extract information from an SVN dump
at an early stage, then use the extracted information when the user
tidies up the SBL file.  This was originally a simple optimisation
(reading a small gzipped JSON file is much faster than reading an SVN
dump that's 99% file bodies you don't care about) but it wouldn't be too
hard to teach svn-fe how to produce the file if you were so inclined.

	- Andrew

[1] http://article.gmane.org/gmane.comp.version-control.git/192286
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html