Hi, Sam Vilain commented on my SoC proposal using Google's SoC interface, and requested me to CC my response to the mailing list. His original comment is also quoted below. ----------------------->8----------------------<8------------------- Hi Sam, > Hi Ramkumar, I've looked at this proposal and seen that it differs a > bit from the version on the list, and I can't see the relevant > discussion, so I'll just throw my bit in here - though note this is a > technical comment and not a critique of your proposal. There's been a lot of discussion, some on the list, and some more off the list. > First, consider not using the SVN API at all while prototyping the > import part of the chain. Instead, parse the 'svnadmin dump' stream > from a local mirror. This will allow you to tackle the actual > problems involved and importing the data effectively, without > suffering from the brain-damage that is the SVN API. After all, the > SVN API should be returning you all of the same information that the > dump stream does, so you can treat making it work using the remote > access API (eg svn_ra_replay, which is faster for mirroring AIUI) as a > separate task. You will also more easily spot information which you > should be extracting from the API, but aren't - it's definitely all in > the dump format; it has to be. I received similar advice to this > before building a perforce importer and let's just say it was > invaluable. Yes. I've studied the SVN API, and I agree with you- it's quite horrible. Instead of providing a API that's transparently backward-compatible, they've provided different methods for different versions. There are also several variations of certain methods, and this is quite confusing. `svnadmin dump` is exactly how I plan to start out- I've already discussed this with my to-be mentor, Sverre Rabbelier and David Michael-Barr, who's building a new SVN exporter in his own time. > Second, consider making the mirror phase emit directly to a tracking > branch via git-fast-export, that is not intended to be checked out. > Instead, it contains trees which correspond to revisions in the > mirrored SVN repository. Directory and file properties can be saved > in the tree using specially named dotfiles, and revision properties > can be saved in the commit log. Perhaps I misread your intentions > with the "stripped down svnsync" part, but syncing to a local SVN > repository seems to me like a waste of time; people can just do that > themselves if they choose anyway. An SVN repository can easily be 10 > times the size of the corresponding git store, and it just seems like > double-handling of the data and will make the whole process slower and > more cumbersome than it needs to be. > With all the blobs already in the git store, and the information > needed to perform the data mining operation which is the extraction of > git-style branch histories from the svn data, you will be working with > data which is all in git-land, and exporting referencing blobs which > are already in the store. This will save you a LOT of time, as it > means in this stage you are not handling the actual file images; just > constructing branch histories in the git-fast-import stream. Your > branch miner will potentially be able to process thousands of > revisions per second this way, even from python. Agreed. Sverre, David and I discussed exactly this- The final version of mirror-client will dump all the SVN information to a Git store first, so we can do the mapping painlessly in Git. There are some concerns about information loss though, which we'll have to deal with as we go on. > Also bear in mind > that people might use SVN in a way that violates the expectations of > this branch miner. An example is putting a README file in the > top-level projects directory, a heuristic approach might consider that > the start of a new project and then mess up later stages. Another > example is people accidentally deleting trunk and re-adding it; the > nice thing about this two-stage approach is that it allows advanced > users to muck with the "raw" data (ie, this whole repository tracking > branch) using git to do things like graft away the bad revisions, and > then the second stage will use the corrected data. Of course > eventually, this detail will be hidden by the remote helper. Excellent suggestion! I'll attempt to build the plumbing for the mapping in a manner that exposes a sane interface. > As a general comment - you must be careful in trying to assume that > what you are attempting is even possible. Sure, you want 'git clone > svn://example.com/myrepo' to work, but what does that mean? A > repository in SVN is a filesystem, which can contain multiple > projects. In git, a repo is a single project. People might expect to > be able to clone the trunk URL for instance. My advice there is to > not support that use case at all, it's a complete can of worms which > you will discover as you tackle the conversion algorithms. Just focus > on making the case where the complete repository is mirrored work for > this project. Mining a single branch out of SVN without all data > available is the domain of git-svn and really you don't want to go > there. Hm, this is something that I hadn't thought about earlier. Thanks for the suggestion- I will not attempt to go into complicated cases, atleast in my summer term. > Anyway like I say, please follow-up on the mailing list, and this > advice can receive wider scrutiny. Thank you for your valuable comment! -- Ram -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html