Re: git-svnimport failed and now git-repack hates me

Linus Torvalds <torvalds@xxxxxxxx> · Thu, 4 Jan 2007 10:30:43 -0800 (PST)

On Thu, 4 Jan 2007, Chris Lee wrote:
>
> Unfortunately, that's how the KDE repo is organized. (I tried arguing
> against this when they were going to do the original import, but I
> lost the argument.) And git-svnimport doesn't appear to have any sort
> of method for splitting a gigantic svn repo into several smaller git
> repos.

Well, the good news is, I think we could probably split it up from within 
git. It's not fundamentally hard, although it is pretty damn expensive 
(and it would require the subproject support to do really well).

So ignore that issue for now. I'd love to see the end result, if only 
because it sounds like you have a test-case for git that is four times 
bigger than the mozilla archive - even if it's just because of some really 
really stupid design decisions from the KDE SVN maintainers ;)

(But I would actually expect that KDE SVN uses SVN subprojects, so 
hopefully it's not _really_ one big repository. Of course, I don't know if 
SVN really does subprojects or how well it does them, so that's just a 
total guess).

The real problem with a SVN import is that I think SVN doesn't do merges 
right, so you can't import merge history properly (well, you can, if you 
decide that "properly" really means "SVN can't merge, so we can't really 
show it as merges in git either").

I think both git-svn and git-svnimport can _guess_ about merges, but it's 
just a heuristic, afaik. Whether it's a good one, I don't know.

> Yeah. I haven't bothered hacking git-svnimport yet - but it looks like
> having it automatically repack every thousand revisions or so would
> probably be a pretty big win.

That, or making it use the same "fastimport" that the hacked-up CVS 
importer was made to use. Either way, somebody who understands SVN 
intimately (and probably perl) would need to work on it. 

That would not be me, so I can't really help ;)

> By default, if I had, say, one pack with the first 1000 revisions, and
> I imported another 1000, running 'git-repack' on its own would leave
> the first pack alone and create a new pack with just the second 1000
> revisions, right?

Yes. It's _probably_ better to do a full re-pack every once in a while 
(because if you have a lot of pack-files, eventually that ends up being 
problematic too), but as a first approximation, it's probably fine to just 
do a plain "git repack" every thousand commits, and then do a full big 
repack at the end.

The big repack will still be pretty expensive, but it should be less 
painful than having everything unpacked. And at least the import won't 
have run with millions and millions of loose objects.

So doing a "git repack -a -d" at the end is a good idea, and _maybe_ it 
could be done in the middle too for really big packs.

Again, doing what fastimport does avoids most of the whole issue, since it 
just generates a pack up-front instead. But that requires the importer to 
specifically understand about that kind of setup.

> This is on a dual-CPU dual-core Opteron, running the AMD64 variant of
> Ubuntu's Edgy release (64-bit kernel, 64-bit native userland). The
> pack-file was around 2.3GB.

Ok, that should all be fine. A 31-bit thing in OpenSSL would explain it, 
and doesn't sound unlikely. Just somebody using "int" somewhere, and it 
would never have been triggered by any sane user of SHA1_Update(). The git 
pack-check.c usage really _is_ very odd, even if it happens to make sense 
in that particular schenario.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html