Re: [spf:guess] Re: [spf:guess] Re: Git import of the recent full enwiki dump

Sam Vilain <sam@xxxxxxxxxx> · Sat, 17 Apr 2010 15:34:52 +1200

On Sat, 2010-04-17 at 03:58 +0200, Sebastian Bober wrote:
> > Without good data set partitioning I don't think I see the above
> > workflow being as possible.  I was approaching the problem by first
> > trying to back a SQL RDBMS to git, eg MySQL or SQLite (postgres would be
> > nice, but probably much harder) - so I first set out by designing a
> > table store.  But the representation of the data is not important, just
> > the distributed version of it.
> 
> Yep, we had many ideas how to partition the data. All that was not tried
> up to now, because we had the hope to get it done the "straight" way.
> But that may not be possible.

I just don't think it's a practical aim or even useful.  Who really
wants the complete history of all wikipedia pages?  Only a very few -
libraries, national archives, and some collectors.

> We have tried checkpointing (even stopping/starting fast-import) every
> 10,000 - 100,000 commits. That does mitigate some speed and memory
> issues of fast-import. But in the end fast-import lost time at every
> restart / checkpoint.

One more thought - fast-import really does work better if you send it
all the versions of a blob in sequence so that it can write out deltas
the first time around.

Another advantage of the per-page partitioning is that they can
checkpoint/gc independently, allowing for more parallelization of the
job.

> > Actually this raises the question - what is it that you are trying to
> > achieve with this wikipedia import?
> 
> Ultimately, having a distributed Wikipedia. Having the possibility to
> fork or branch Wikipedia, to have an inclusionist and exclusionist
> Wikipedia all in one.

This sounds like far too much fun for me to miss out on, now downloading
enwiki-20100312-pages-meta-history.xml.7z :-) and I will give this a
crack!

Sam

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html