Re: git-svnimport failed and now git-repack hates me

"Chris Lee" <clee@xxxxxxx> · Thu, 4 Jan 2007 09:56:52 -0800

Accidentally sent this to just Linus instead of the list...

On 1/3/07, Linus Torvalds <torvalds@xxxxxxxx> wrote:
> So I'm using git 1.4.1, and I have been experimenting with importing
> the KDE sources from Subversion using git-svnimport.

As one single _huge_ import? All the sub-projects together? I have to say,
that sounds pretty horrid.

Unfortunately, that's how the KDE repo is organized. (I tried arguing
against this when they were going to do the original import, but I
lost the argument.) And git-svnimport doesn't appear to have any sort
of method for splitting a gigantic svn repo into several smaller git
repos.

> First issue I ran into: On a machine with 4GB of RAM, when I tried to
> do a full import, git-svnimport died after 309906 revisions, saying
> that it couldn't fork.
>
> Checking `top` and `ps` revealed that there were no git-svnimport
> processes doing anything, but all of my 4G of RAM was still marked as
> used by the kernel. I had to do sysctl -w vm.drop_caches=3 to get it
> to free all the RAM that the svn import had used up.

I think that was just all cached, and all ok. The reason you didn't see
any git-svnimport was that it had died off already, and all your memory
was just caches. You could just have left it alone, and the kernel would
have started re-using the memory for other things even without any
"drop_caches".

But what you did there didn't make anything worse, it was just likely had
no real impact.

I got the tip about drop_caches from davej. Normally, when a process
taking up a huge amount of memory exits, it shows a bunch of free
memory in `top` and friends. I was a little bit surprised when that
didn't happen this time.

However, it does sound like git-svnimport probably acts like git-cvsimport
used to, and just keeps too much in memory - so it's never going to act
really nicely..

It also looks like git-svnimport never repacks the repo, which is
absolutely horrible for performance on all levels. The CVS importer
repacks every one thousand commits or something like that.

Yeah. I haven't bothered hacking git-svnimport yet - but it looks like
having it automatically repack every thousand revisions or so would
probably be a pretty big win.

> Now, after that, I tried doing `git-repack -a` because I wanted to see
> how small the packed archive would be (before trying to continue
> importing the rest of the revisions. There are at least another 100k
> revisions that I should be able to import, eventually.)

I suspect you'd have been better off just re-starting, and using something
like

        while :
        do
                git svnimport -l 1000 <...>
                .. figure out some way to decide if it's all done ..
                git repack -d
        done

which would make svnimport act a bit  more sanely, and repack
incrementally. That should make both the import much faster, _and_ avoid
any insane big repack at the end (well, you'd still want to do a "git
repack -a -d" at the end to turn the many smaller packs into a bigger one,
but it would be nicer).

However, I don't know what the proper magic is for svnimport to do that
sane "do it in chunks and tell when you're all done". Or even better - to
just make it repack properly and not keep everything in memory.

You can pass limits to svnimport to give it a revision to start at and
another one to end at, so that wouldn't be too bad - I was thinking
about working around it like that (so that i don't have to go poking
around in the Perl code behind the svn importer).

By default, if I had, say, one pack with the first 1000 revisions, and
I imported another 1000, running 'git-repack' on its own would leave
the first pack alone and create a new pack with just the second 1000
revisions, right?

> The repack finished after about nine hours, but when I try to do a
> git-verify-pack on it, it dies with this error message:
>
> error: Packfile
> .git/objects/pack/pack-540263fe66ab9398cc796f000d52531a5c6f3df3.pack
> SHA1 mismatch with itself

That sounds suspiciously like the bug we had in out POWER sha1
implementation that would generate the wrong SHA1 for any pack-file that
was over 512MB in size, due to an overflow in 32 bits (SHA1 does some
counting in _bits_, so 512MB is 4G _bits_),

Now, I assume you're not on POWER (and we fixed that bug anyway - and I
think long before 1.4.1 too), but I could easily imagine the same bug in
some other SHA1 implementation (or perhaps _another_ overflow at the 1GB
or 2GB mark..). I assume that the pack-file you had was something horrid..

I hope this is with a 64-bit kernel and a 64-bit user space? That should
limit _some_ of the issues. But I would still not be surprised if your
SHA1 libraries had some 32-bit ("unsigned int") or 31-bit ("int") limits
in them somewhere - very few people do SHA1's over huge areas, and even
when you do SHA1 on something like a DVD image (which is easily over any
4GB limit), that tends to be done as many smaller calls to the SHA1
library routines.

This is on a dual-CPU dual-core Opteron, running the AMD64 variant of
Ubuntu's Edgy release (64-bit kernel, 64-bit native userland). The
pack-file was around 2.3GB.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html