On Fri, Jun 03, 2011 at 12:37:04AM +0200, Matthieu Moy wrote: > The idea is that we ultimately want to be able to import a subset of a > large wiki. In Wikipedia, for example, "show me revisions since N" will > be very large after a few minutes. OTOH, "show me revisions touching the > few pages I'm following" should be fast. And at least, it's O(imported > wiki size), not O(complete wiki size) Yeah, I think what you want to do is dependent on wiki size. For a small wiki, it doesn't matter; all pages is not much. For a large wiki, you want a subset of the pages, and you _never_ want to do any operations on the whole page space. In the middle are medium-sized wikis, where you would like look at the whole page space, but ideally not in O(number of pages). But the point is somewhat moot, because having just read through the mediawiki API, I've come to the conclusion (which seems familiar from the last time I looked at this problem) that there is no way to ask for what I want in a single query. That is, to say "show me all revisions of all pages matching some subset X, that have been modified since revision N". Or even "show me all pages matching some subset X that have been modified since revision N", and then we could at least cull the pages that haven't been touched. But AFAICT, none of those is possible. I think we are stuck asking for each page's information individually (you can even query multiple pages' revision information simultaneously, but you can get only a single revision from each in that case. There's not even a way to say "get me the latest revision number for all of these pages). One thing we could do to reduce the total run-time is to issue several queries in parallel so that the query latency isn't so prevalent. I don't know what a good level of parallelism is for a server like wikipedia, though. I'm sure they don't appreciate users hammering the servers too hard. Ideally you want just enough queries outstanding that the remote server is always working on _one_, and the rest are doing something else (traveling across the network, local processing and storage, etc). But I'm not sure of a good way to measure that. > but let's not be too ambitious for now: it's a student's project, > completing one week from now, and the goal is to have something clean > and extensible. Bells and whistles will come later ;-). Yes. I think all of this is outside the scope of a student project. I just like to dream. :) -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html