Re: Git import of the recent full enwiki dump

Sebastian Bober <sbober@xxxxxxxxxxxxx> · Sat, 17 Apr 2010 03:25:31 +0200

On Sat, Apr 17, 2010 at 03:10:56AM +0200, Richard Hartmann wrote:
> On Sat, Apr 17, 2010 at 02:19, Sverre Rabbelier <srabbelier@xxxxxxxxx> wrote:
> 
> > Assuming you do the import incrementally
> > using something like git-fast-import (feeding it with a custom
> > exporter that uses the dump as it's input) you shouldn't even need an
> > extraordinary machine to do it (although you'd need a lot of storage).
> 
> I am using a Python script [1] to import the XML dump.

There is also a version available at (plug):

  git://github.com/sbober/levitation-perl.git

That is a bit faster and consumes less memory (and is written in Perl).
But that, too, will not be able to handle enwiki at the moment.

> 
> 
> > Speaking of which, it might make sense to separate the
> > worktree by prefix, so articles starting with "aa" go under the "aa"
> > directory, etc?
> 
> Very good idea. What command would I need to send to
> git-fast-import to do that?

levitation does that already. 

> 
> > Hope that helps, and if you do convert it (and it turns out to be
> > usable, and you decide to keep it up to date somehow), put it up
> > somewhere! :)
> 
> It did.
> I will make it available if it turns out to be useful. Keeping it up to
> date might be harder unless they keep on releasing new
> (incremental) snapshots.

If desired, I could produce input files for git-fast-import for a larger
wiki (like german or japanese wikipedia), so that other people might
have a look at the performance.

bye,
  Sebastian
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html