Re: [PATCHv3 1/2] Add a remote helper to interact with mediawiki, pull & clone handled

Jeff King <peff@xxxxxxxx> · Thu, 9 Jun 2011 18:44:25 -0400

On Thu, Jun 09, 2011 at 03:15:59PM +0200, Jeremie Nikaes wrote:

> For now, the whole wiki is cloned, but it will be possible to clone only
> some pages: the clone is based on a list of pages which is now all
> pages.

This is not really true anymore, is it? Later you say:

> Partial cloning is supported using the following syntax :
> "git clone mediawiki::http://wikiurl##A_Page##Another_Page";
> As always, this url is kept in .git/config, helping to always keep
> track of these specific pages

so I think it is not "will be" possible any more.

> +sub get_pages{
> +	my $mediawiki = MediaWiki::API->new;
> +	$mediawiki->{config}->{api_url} = "$url/api.php";
> [...]
> +	} else {
> +		#the list of titles should follow the pattern 'page1|page2|...'
> +		my $titles = "";
> +		foreach my $title (@pages_titles){
> +			$titles.="$title|";
> +		}
> +		#supress the last | that is add in the foreach
> +		chop($titles);

This is usually spelled:

  my $titles = join('|', @pages_titles);

> +		$pages = $mediawiki->api({
> +			action => 'query',
> +			titles => $titles,
> +		});
> +		if (!defined($pages)) {
> +			print STDERR "fatal: None of the pages exist \n";
> +			exit 1;
> +		}

That's not an accurate error message. If the pages don't exist, we will
actually get back a set of pages with negative ids (so you can tell
which ones exist and which ones don't). If $pages is undefined, it's
actually not a valid mediawiki repo.

Also, according to the mediawiki API, we can send only 51 titles at a
time. So we need to break this into pieces.

However, I wonder if this code path is needed at all. We are mapping
titles to page ids, so that we can later ask mediawiki for revisions by
page id. But why not just ask for revisions by title, and skip this
extra round trip to the server?

Speaking of round trips, I did have an idea for reducing round-trips in
the "mostly up to date" case. We can ask for the revisions for multiple
titles at once (apparently up to 51, or 501 if you have special bot
privileges), but you will only get the latest revision for each. Which
isn't sufficient for us to do anything except tell whether or not there
are any revisions to fetch.

So without the optimization, with N pages we will make N requests for
new revisions. But with it, we will make N/51 requests for the latest
revisions, and then M (where M <= N) for every page that actually has
new content. In other words, it is a good optimization as long as less
than 49/51 pages have changed, on average. So it's bad for "clone", but
very good for a subsequent "fetch". The best case is a fetch where
nothing has changed, which should have 1/51th as many round-trips to
determine that is the case.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html