Re: [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled

Jeff King <peff@xxxxxxxx> · Thu, 2 Jun 2011 23:43:48 -0400

On Fri, Jun 03, 2011 at 12:37:04AM +0200, Matthieu Moy wrote:

> The idea is that we ultimately want to be able to import a subset of a
> large wiki. In Wikipedia, for example, "show me revisions since N" will
> be very large after a few minutes. OTOH, "show me revisions touching the
> few pages I'm following" should be fast. And at least, it's O(imported
> wiki size), not O(complete wiki size)

Yeah, I think what you want to do is dependent on wiki size. For a small
wiki, it doesn't matter; all pages is not much. For a large wiki, you
want a subset of the pages, and you _never_ want to do any operations on
the whole page space. In the middle are medium-sized wikis, where you
would like look at the whole page space, but ideally not in O(number of
pages).

But the point is somewhat moot, because having just read through the
mediawiki API, I've come to the conclusion (which seems familiar
from the last time I looked at this problem) that there is no way to ask
for what I want in a single query. That is, to say "show me all
revisions of all pages matching some subset X, that have been modified
since revision N". Or even "show me all pages matching some subset X
that have been modified since revision N", and then we could at least
cull the pages that haven't been touched.

But AFAICT, none of those is possible. I think we are stuck asking for
each page's information individually (you can even query multiple pages'
revision information simultaneously, but you can get only a single
revision from each in that case. There's not even a way to say "get me
the latest revision number for all of these pages).

One thing we could do to reduce the total run-time is to issue several
queries in parallel so that the query latency isn't so prevalent. I
don't know what a good level of parallelism is for a server like
wikipedia, though. I'm sure they don't appreciate users hammering the
servers too hard. Ideally you want just enough queries outstanding that
the remote server is always working on _one_, and the rest are doing
something else (traveling across the network, local processing and
storage, etc). But I'm not sure of a good way to measure that.

> but let's not be too ambitious for now: it's a student's project,
> completing one week from now, and the goal is to have something clean
> and extensible. Bells and whistles will come later ;-).

Yes. I think all of this is outside the scope of a student project. I
just like to dream. :)

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html