Re: GSOC Proposal draft: git-remote-svn

Andrew Sayers <andrew-git@xxxxxxxxxxxxxxx> · Thu, 12 Apr 2012 23:30:29 +0100

On 12/04/12 16:28, Florian Achleitner wrote:
> 
> I'm not sure if storing this in a seperate directory tree makes sense, mostly 
> looking at performance. All these files will only contain some bytes, I guess.
> Andrew, why did you choose JSON?
> 

JSON has become my default storage format in recent years, so it seemed
like the natural thing to use for a format I wanted to chuck in and get
on with my work :)

JSON is my default format because it's reasonably space-efficient,
human-readable, widely supported and can represent everything I care
about except recursive data structures (which I didn't need for this
job).  You can do cleverer things if you don't mind being
language-specific (e.g. Perl's "Storable" module supports recursive data
structures but can't be used with other languages) or if you don't mind
needing special tools (e.g. git's index is highly efficient but can't be
debugged with `less`).  I've found you won't go far wrong if you start
with JSON and pick something else when the requirements become more obvious.

I gzipped the file because JSON isn't *that* space-efficient, and
because very large repositories are likely to produce enough JSON that
people will notice.  I found that gzipping the file significantly
reduced its size without having too much effect on run time.

I've attached a sample file representing the first few commits from the
GNU R repository.  The problem I referred to obliquely before isn't with
JSON, but with gzip - how would you add more revisions to the end of the
file without gunzipping it, adding one line, then gzipping it again?
One very nice feature of a directory structure is that you could store
it in git and get all that stuff for free.

To be clear, I'm not pushing any particular solution to this problem,
just offering some anecdotal evidence.  I'm pretty sure that SVN branch
export is an I/O bound problem - David Barr has said much the same about
svn-fe, but I was surprised to see it was still the bottleneck with a
problem that stripped out almost all the data from the dump and pushed
it through not-particularly-optimised Perl.  Having said that, the
initial import problem (potentially hundreds of thousands of revisions
needing manual attention) doesn't necessarily want the same solution as
update (tens of revisions that can almost always be read automatically).

>>  . tracing history past branch creation events, using the now-saved
>>    copyfrom information.
>>
>>  . tracing second-parent history using svn:mergeinfo properties.
> 
> This is about detection when to create a git merge-commit, right?

Yes - SVN has always stored metadata about where a directory was copied
from (unlike git, which prefers to detect it automatically), and since
version 1.0.5, SVN has added "svn:mergeinfo" metadata to files and
directories specifying which revisions of which other files or
directories have been cherry-picked in to them.

If you know a directory is a branch, "copyfrom" metadata is a very
useful signal for detecting branches created from it.  Unfortunately,
"svn:mergeinfo" is not as useful - aside from anything else, older
repositories often exhibit a period where there's no metadata at all,
then a gradual migration through SVN's early experiments with merge
tracking (like svnmerge.py), before everyone gradually standardises on
svn:mergeinfo and leaves the other tools behind.  Oh, and the interface
doesn't tell you about unmerged revisions, so if anybody ever forgets to
merge a revision then you'll probably never notice.

I'm planning to tackle this stuff in the work I'm doing, but I expect
people will be reporting edge cases until the day the last SVN
repository shuts down.  You shouldn't need to worry about it much on the
git side of SBL, which is probably best for your sanity ;)

	- Andrew
Attachment:
repo.json.gz

Description: GNU Zip compressed data