[Announce] Availability of a Java based Subversion to Git conversion program with automatic (and pluggable) branch detection for whole repository conversions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello, 

For the past 8 months I've been working on the process and tooling to convert the Kuali Student subversion repository into Git and to support CI on pull requests with auto merge to trunk once the builds were green and the appropriate sign-off provided.

The Kuali Student project is being restructured so my work on CI was halted but I was able to get the repository converted and placed into Github: https://github.com/kuali-student/archived-from-svn (Contains Revisions r1 through r77740 in 20,631 branches and 95,297 git commits)

This work is not being officially supported by the Kuali Foundation in the future but I'm personally interested in seeing other projects use it to convert from Subversion into Git (and find any edge cases it might not be handling correctly right now)

Since our conversion was successful I wanted to alert users on both the Git and JGit mailing lists about the Java based conversion program that I wrote to perform the conversion.

The code and some cursory instructions on how to use it are located here: https://github.com/kuali-student/git-repository-tools, the current version can be downloaded from maven central: http://search.maven.org/#artifactdetails|org.kuali.student.repository|git-importer|0.0.4|jar

It is intended for larger repositories like kuali student with around 100,000 or more revisions with many product streams and questionable at times branch naming strategies; instead of enumerating which branches you want this conversion program will extract everything.  

You can specify a per repository branch detection strategy to handle non-standard cases so that looking back in the history things will make more sense, but it shouldn't be needed to get an accurate repository.   

Key features:

. Full repository conversion by parsing Subversion version 2 dump streams and writing into a bare Git repository.

. Automatic branch detection 
o We figure out how to split the full path to a blob into a branch part and file part.  The branch part becomes the name of the branch and the file part turns into the Tree object of the commit.
o We track copy-from information similar I think to how git-svn does so that for each subversion revision we have a list of all of the git branches and their heads at that point in time.
o We convert everything but if the branch naming is non-standard you can have branches created with subdirectories that really should have been separate branches themselves.

. Plugin Mechanism to define custom per repository branch detection (before falling back on the standard mechanisms)
o See how the student-plugin(https://github.com/kuali-student/git-repository-tools/tree/master/git-importer-student-plugin) was setup with its own custom branch detection logic (lots of conversion iterations in that scheme)  
o Also look at how unit testing can be done on the repository specific branch scenarios.

. Fusion instead of submodules for svn:externals
o The fusion-maven-plugin was created and its fuse mojo will do essentially a multi subtree merge to turn the aggregate branch (the one where the svn:externals property was set) into a commit whose tree contains actual materialized subdirectories with the tree of the module branch placed within it.
o The git-importer would leave fusion-maven-plugin.dat files in the root of the commit tree's where in svn there had been svn:externals set so that this fusion process could be applied at a later point in time.

. Fairly fast
o Creating the subversion mirror and dump files can take some time
o The KS svn repository mirror was 8.8 GB but that turned into about 20 GB using bzip2 compressed subversion version 2 dump streams
o Running against the existing dump files it would take the importer about 12 hours to perform the full conversion on a low end core 2 duo (3Ghz) writing on a raid-0 7200 RPM disk drive.  

. Accurate
o I compared our key release tags and development branches by doing an subversion export of the particular equivalent of the git branch (based on the path and revision in the comment) and added into git and then did a git diff to make sure there are no differences (and when I found differences I tracked them down and added unit tests to reproduce and fixed them).


Additional Cleanup Programs:

The git-repository-tools repository also includes the cleanup programs we used https://github.com/kuali-student/git-repository-tools/tree/master/git-repo-cleaner:

Our initial converted repo was 2.8 GB (final was 1.3 GB) so we looked at splitting the graph based on date and then using git grafts to give developers the full history.  The splitting program would find the split point and write a grafts file for later use.

But we found out that a certain kind of database file was taking up a lot of space (over 50% of the converted repository) so instead of splitting we did three cleanup operations:
1. Remove the content of all .mpx database files (sql files built a db which then dumped out csv files stored in files ending in .mpx which is what we removed)
2. Remove two big files, one of them > 100 MB which was blocking the github upload.
3. Rewrite all of the commits rewritten in steps 1 and 2 to update the fusion-maven-plugin.dat files generated by the exporter to use the latest commit ids (so that fusion would work)

Step 3 was interesting because the fusion-maven-plugin.dat files contained essentially extra parentage information so we needed to sort the list of 95,297 commits in such a way that real and fusion parents were emitted first.   I was able to use the EWAH Compressed bitmap for this purpose (I used the EWAH bitmaps directly but was inspired by the jgit packfile bitmap implementation).

The https://github.com/kuali-student/git-repository-tools/blob/master/git-repo-cleaner/src/main/java/org/kuali/student/git/cleaner/AbstractRepositoryCleaner.java class can be extended to support other use cases.

It loads all of the commits in the repository in a parents first ordering and then provides hooks to do different things.  It takes care of updating the branch and tag references to point at the rewritten commits.

All of this code is licensed under the Educational Community License, Version 2.0 (An add on to the Apache licence, Version 2.0).

Hopefully it will be useful to others, 

Regards, 

Michael

--
Michael O'Cleirigh
Java Developer 
Enterprise Applications and Solutions Integration (EASI)
University of Toronto

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]