Re: [PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks

"Martin Langhoff" <martin.langhoff@xxxxxxxxx> · Fri, 26 May 2006 18:02:30 +1200

On 5/26/06, Linus Torvalds <torvalds@xxxxxxxx> wrote:
I'm doing it too, just for fun.

Well, it's good to not be so alone in our definition of fun ;-)

Of course, since I'm doing this on a machine that basically has a laptop
disk, the "just for fun" part is a bit sad. It's waiting for disk about
25% of the time ;/

Ouch.

And it's slow as hell. I really wish we could do better on the CVS import
front.

Me too. However, I don't think the perl part is so costly anymore.
It's down to waiting on IO. git-write-tree is also prominently there.
It takes a lot of memory in some writes -- I had thought it'd be
cheaper as it takes one tree object at the time...

I also have a trivial patch that I haven't posted yet, that runs cvsps
to a tempfile, and then reads the file. Serialising the tasks means
that we don't carry around cvsps' memory footprint during the import
itself.

...
It's "git-rev-list --objects" that is the memory sucker for me, the
packing itself doesn't seem to be too bad.

No, you're right, it's git-rev-list that gets called during the
repack. But it was pushing everything it could to swap. Once it didn't
fit in memory, it hit a brick wall :(

The biggest cost seems to be git-write-tree, which is about 0.225 seconds
for me on that tree on that machine. Which _should_ mean that we could do
4 commits a second, but that sure as hell ain't how it works out. It seems
to do about 1.71 commits a second for me on that tree, which is pretty
damn pitiful. Some cvs overhead, and probably some other git overhead too.

Well, we _have_ to fetch the file. I guess you are thinking of
extracting if frrom the RCS ,v file directly? One tihng that I found
that seemed to speed things up a bit was to declare TMPDIR to be a
directory in the same partition.

(That's a 2GHz Merom, so the fact that you get ~6k commits per hour on
your 2GHz Opteron is about the same speed - I suspect you're also at least
partly limited by disk, our numbers seem to match pretty well).

Yup. This is _very_ diskbound.

200k commits at 6k commits per hour is about a day and a half (plus the
occasional packing load). Taking that long to import a CVS archive is
horrible. But I guess it _is_ several years of work, and I guess you
really have to do it only once, but still.

And it's a huge CVS archive too.

martin
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html