Re: Git import of the recent full enwiki dump

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Apr 16, 2010 at 05:53:42PM -0700, Shawn O. Pearce wrote:
> Sebastian Bober <sbober@xxxxxxxxxxxxx> wrote:
> > The question would be, how the commits and the trees are laid out.
> > If every wiki revision shall be a git commit, then we'd need to handle
> > 300M commits. And we have 19M wiki pages (that would be files). The tree
> > objects would be very large and git-fast-import would crawl.
> > 
> > Some tests with the german wikipedia have shown that importing the blobs
> > is doable on normal hardware. Getting the trees and commits into git
> > was not possible up to now, as fast-import was just to slow (and getting
> > slower after 1M commits).
> 
> Well, to be fair to fast-import, its tree handling code is linear
> scan based, because that's how any other part of Git handles trees.
> 
> If you just toss all 19M wiki pages into a single top level tree,
> that's going to take a very long time to locate the wiki page
> talking about Zoos.
> 

I'm not dissing fast-import, it's fantastic. We tried with 2-10 level
deep trees (the best depth being 3), but after some million commits it
just got unbearably slow, with the ETA constantly rising.

That was because of tree creation, and SHA1 computing of these tree
objects.

> > I had the idea of having an importer that would just handle this special
> > case (1 file change per commit), but didn't get around to try that yet.
> 
> Really, fast-import should be able to handle this well, assuming you
> aren't just tossing all 19M files into a single massive directory
> and hoping for the best.  Because *any* program working on that
> sort of layout will need to spit out the 19M entry tree object on
> each and every commit, just so it can compute the SHA-1 checksum
> to get the tree name for the commit.
> 
> -- 
> Shawn.
> 
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]