On 12/04/12 16:28, Florian Achleitner wrote: > > I'm not sure if storing this in a seperate directory tree makes sense, mostly > looking at performance. All these files will only contain some bytes, I guess. > Andrew, why did you choose JSON? > JSON has become my default storage format in recent years, so it seemed like the natural thing to use for a format I wanted to chuck in and get on with my work :) JSON is my default format because it's reasonably space-efficient, human-readable, widely supported and can represent everything I care about except recursive data structures (which I didn't need for this job). You can do cleverer things if you don't mind being language-specific (e.g. Perl's "Storable" module supports recursive data structures but can't be used with other languages) or if you don't mind needing special tools (e.g. git's index is highly efficient but can't be debugged with `less`). I've found you won't go far wrong if you start with JSON and pick something else when the requirements become more obvious. I gzipped the file because JSON isn't *that* space-efficient, and because very large repositories are likely to produce enough JSON that people will notice. I found that gzipping the file significantly reduced its size without having too much effect on run time. I've attached a sample file representing the first few commits from the GNU R repository. The problem I referred to obliquely before isn't with JSON, but with gzip - how would you add more revisions to the end of the file without gunzipping it, adding one line, then gzipping it again? One very nice feature of a directory structure is that you could store it in git and get all that stuff for free. To be clear, I'm not pushing any particular solution to this problem, just offering some anecdotal evidence. I'm pretty sure that SVN branch export is an I/O bound problem - David Barr has said much the same about svn-fe, but I was surprised to see it was still the bottleneck with a problem that stripped out almost all the data from the dump and pushed it through not-particularly-optimised Perl. Having said that, the initial import problem (potentially hundreds of thousands of revisions needing manual attention) doesn't necessarily want the same solution as update (tens of revisions that can almost always be read automatically). >> . tracing history past branch creation events, using the now-saved >> copyfrom information. >> >> . tracing second-parent history using svn:mergeinfo properties. > > This is about detection when to create a git merge-commit, right? Yes - SVN has always stored metadata about where a directory was copied from (unlike git, which prefers to detect it automatically), and since version 1.0.5, SVN has added "svn:mergeinfo" metadata to files and directories specifying which revisions of which other files or directories have been cherry-picked in to them. If you know a directory is a branch, "copyfrom" metadata is a very useful signal for detecting branches created from it. Unfortunately, "svn:mergeinfo" is not as useful - aside from anything else, older repositories often exhibit a period where there's no metadata at all, then a gradual migration through SVN's early experiments with merge tracking (like svnmerge.py), before everyone gradually standardises on svn:mergeinfo and leaves the other tools behind. Oh, and the interface doesn't tell you about unmerged revisions, so if anybody ever forgets to merge a revision then you'll probably never notice. I'm planning to tackle this stuff in the work I'm doing, but I expect people will be reporting edge cases until the day the last SVN repository shuts down. You shouldn't need to worry about it much on the git side of SBL, which is probably best for your sanity ;) - Andrew
Attachment:
repo.json.gz
Description: GNU Zip compressed data