On Sat, 2010-04-17 at 03:01 +0200, Sebastian Bober wrote: > I'm not dissing fast-import, it's fantastic. We tried with 2-10 level > deep trees (the best depth being 3), but after some million commits it > just got unbearably slow, with the ETA constantly rising. How often are you checkpointing? Like any data import IME, you can't leave transactions going indefinitely and expect good performance! Would it be at all possible to consider using a submodule for each page? With a super-project commit which is updated for every day of updates or so. This will create a natural partitioning of the data set in a way which is likely to be more useful and efficient to work with. Hand-held devices can be shipped with a "shallow" clone of the main repository, with shallow clones of the sub-repositories too (in such a setup, the device would not really use a checkout of course to save space). Then, history for individual pages could be extended as required. The device could "update" the master history, so it would know in summary form which pages have changed. It would then go on to fetch updates for individual pages that the user is watching, or potentially even get them all. There's an interesting next idea here: device-to-device update bundles. And another one: distributed update; if, instead of writing to a "master" version - the action of editing a wiki page becomes to create a fork and the editorial process promotes these forks to be the master version in the superproject. Users which have pulled the full repository for a page will be able to see other peoples' forks, to get "latest" versions or for editing purposes. This adds not only a distributed update action, but the ability to have decent peer review/editorial process without it being arduous. Without good data set partitioning I don't think I see the above workflow being as possible. I was approaching the problem by first trying to back a SQL RDBMS to git, eg MySQL or SQLite (postgres would be nice, but probably much harder) - so I first set out by designing a table store. But the representation of the data is not important, just the distributed version of it. Actually this raises the question - what is it that you are trying to achieve with this wikipedia import? Sam -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html