Re: Git import of the recent full enwiki dump

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Fri, 16 Apr 2010 17:53:42 -0700

Sebastian Bober <sbober@xxxxxxxxxxxxx> wrote:
> The question would be, how the commits and the trees are laid out.
> If every wiki revision shall be a git commit, then we'd need to handle
> 300M commits. And we have 19M wiki pages (that would be files). The tree
> objects would be very large and git-fast-import would crawl.
> 
> Some tests with the german wikipedia have shown that importing the blobs
> is doable on normal hardware. Getting the trees and commits into git
> was not possible up to now, as fast-import was just to slow (and getting
> slower after 1M commits).

Well, to be fair to fast-import, its tree handling code is linear
scan based, because that's how any other part of Git handles trees.

If you just toss all 19M wiki pages into a single top level tree,
that's going to take a very long time to locate the wiki page
talking about Zoos.

> I had the idea of having an importer that would just handle this special
> case (1 file change per commit), but didn't get around to try that yet.

Really, fast-import should be able to handle this well, assuming you
aren't just tossing all 19M files into a single massive directory
and hoping for the best.  Because *any* program working on that
sort of layout will need to spit out the 19M entry tree object on
each and every commit, just so it can compute the SHA-1 checksum
to get the tree name for the commit.

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html