Re: Git import of the recent full enwiki dump

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Heya,

[-wikitech-l, if they should be kept on the cc please re-add, I assume
that the discussion of the git aspects are not relevant to that list]

On Sat, Apr 17, 2010 at 01:47, Richard Hartmann
<richih.mailinglist@xxxxxxxxx> wrote:
> This data set is probably the largest set of changes on earth, so
> it's highly interesting to see what git will make of it.

I think that git might actually be able to handle it. Git's been known
not to handle _large files_ very well, but a lot of history/a lot of
files is something different. Assuming you do the import incrementally
using something like git-fast-import (feeding it with a custom
exporter that uses the dump as it's input) you shouldn't even need an
extraordinary machine to do it (although you'd need a lot of storage).

> As of right now, I am trying to import on my local machine, but
> my first, rough, projections tell me my machine will melt down at
> some point ;)

How are you importing? Did you script the process that does something
like 'move next revision of file in place && git add . && git commit'?
I don't know how well that would work since I reckon the worktree will
be huge. Speaking of which, it might make sense to separate the
worktree by prefix, so articles starting with "aa" go under the "aa"
directory, etc?

Anyway, other gits might have more interesting things to say, cc-ed is
Avery, who has been working on a tool to back-up entire harddrives in
git. Also cc-ed are Nico and Shawn who both have a lot of experience
with the object backend and the pack implementation. Also, Sam, who
has worked on importing the entire Perl history into git, not sure how
big that is though, but they have a lot of changesets too I think.
There's a bunch of people that have worked on importing stuff like KDE
into git, who might have interesting things to add, but I don't know
who those are.

Hope that helps, and if you do convert it (and it turns out to be
usable, and you decide to keep it up to date somehow), put it up
somewhere! :)

-- 
Cheers,

Sverre Rabbelier
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]