Re: [spf:guess] Re: [spf:guess] Re: Git import of the recent full enwiki dump

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 2010-04-17 at 03:58 +0200, Sebastian Bober wrote:
> > Without good data set partitioning I don't think I see the above
> > workflow being as possible.  I was approaching the problem by first
> > trying to back a SQL RDBMS to git, eg MySQL or SQLite (postgres would be
> > nice, but probably much harder) - so I first set out by designing a
> > table store.  But the representation of the data is not important, just
> > the distributed version of it.
> 
> Yep, we had many ideas how to partition the data. All that was not tried
> up to now, because we had the hope to get it done the "straight" way.
> But that may not be possible.

I just don't think it's a practical aim or even useful.  Who really
wants the complete history of all wikipedia pages?  Only a very few -
libraries, national archives, and some collectors.

> We have tried checkpointing (even stopping/starting fast-import) every
> 10,000 - 100,000 commits. That does mitigate some speed and memory
> issues of fast-import. But in the end fast-import lost time at every
> restart / checkpoint.

One more thought - fast-import really does work better if you send it
all the versions of a blob in sequence so that it can write out deltas
the first time around.

Another advantage of the per-page partitioning is that they can
checkpoint/gc independently, allowing for more parallelization of the
job.

> > Actually this raises the question - what is it that you are trying to
> > achieve with this wikipedia import?
> 
> Ultimately, having a distributed Wikipedia. Having the possibility to
> fork or branch Wikipedia, to have an inclusionist and exclusionist
> Wikipedia all in one.

This sounds like far too much fun for me to miss out on, now downloading
enwiki-20100312-pages-meta-history.xml.7z :-) and I will give this a
crack!

Sam

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]