Re: [spf:guess] Re: [spf:guess] Re: Git import of the recent full enwiki dump

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Apr 17, 2010 at 03:34:52PM +1200, Sam Vilain wrote:
> On Sat, 2010-04-17 at 03:58 +0200, Sebastian Bober wrote:
> > > Without good data set partitioning I don't think I see the above
> > > workflow being as possible.  I was approaching the problem by first
> > > trying to back a SQL RDBMS to git, eg MySQL or SQLite (postgres would be
> > > nice, but probably much harder) - so I first set out by designing a
> > > table store.  But the representation of the data is not important, just
> > > the distributed version of it.
> > 
> > Yep, we had many ideas how to partition the data. All that was not tried
> > up to now, because we had the hope to get it done the "straight" way.
> > But that may not be possible.
> 
> I just don't think it's a practical aim or even useful.  Who really
> wants the complete history of all wikipedia pages?  Only a very few -
> libraries, national archives, and some collectors.

Heh, exactly. And I just want to see, if it can be done.

> > We have tried checkpointing (even stopping/starting fast-import) every
> > 10,000 - 100,000 commits. That does mitigate some speed and memory
> > issues of fast-import. But in the end fast-import lost time at every
> > restart / checkpoint.
> 
> One more thought - fast-import really does work better if you send it
> all the versions of a blob in sequence so that it can write out deltas
> the first time around.

This is already done thah way.

> Another advantage of the per-page partitioning is that they can
> checkpoint/gc independently, allowing for more parallelization of the
> job.
> 
> > > Actually this raises the question - what is it that you are trying to
> > > achieve with this wikipedia import?
> > 
> > Ultimately, having a distributed Wikipedia. Having the possibility to
> > fork or branch Wikipedia, to have an inclusionist and exclusionist
> > Wikipedia all in one.
> 
> This sounds like far too much fun for me to miss out on, now downloading
> enwiki-20100312-pages-meta-history.xml.7z :-) and I will give this a
> crack!


Please have a look at a smaller wiki for testing, and the project at

  git://github.com/sbober/levitation-perl.git

provides several ways to parse the XML and to generate the fast-import
input in its branches.


bye,
  Sebastian

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]