Re: [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 02 Jun 2011 11:01:19 -0700

Arnaud Lacurie <arnaud.lacurie@xxxxxxxxxxxxxxx> writes:

>  contrib/mw-to-git/git-remote-mediawiki     |  252 ++++++++++++++++++++++++++++
>  contrib/mw-to-git/git-remote-mediawiki.txt |    7 +
>  2 files changed, 259 insertions(+), 0 deletions(-)

It is pleasing to see that a half of a custom backend can be done in just
250 lines of code.  I understand that this is a work-in-progress with many
unnecessary lines spitting debugging output to STDERR, whose removal will
further shrink the code?

> +# commands parser
> +my $loop = 1;
> +my $entry;
> +my @cmd;
> +while ($loop) {

This is somewhat unusual-looking loop control.

Wouldn't "while (1) { ...; last if (...); if (...) { last; } }" do?

> +	$| = 1; #flush STDOUT
> +	$entry = <STDIN>;
> +	print STDERR $entry;
> +	chomp($entry);
> +	@cmd = undef;
> +	@cmd = split(/ /,$entry);
> +	switch ($cmd[0]) {
> +		case "capabilities" {
> +			if ($cmd[1] eq "") {
> +				mw_capabilities();
> +			} else {
> +			       $loop = 0;

I presume that this is "We were expecting to read capabilities command but
found something unexpected; let's abort". Don't you want to say something
to the user here, perhaps on STDERR?

> +			}
> +		}
> ...
> +		case "option" {
> +			mw_option($cmd[1],$cmd[2]);
> +		}

No error checking only for this one?

> +		case "push" {
> +			#check the pattern +<src>:<dist>

The latter one is usually spelled <dst> standing for "destination".

> +			my @pushargs = split(/:/,$cmd[1]);
> +			if ($pushargs[1] ne "" && $pushargs[2] eq ""
> +			&& (substr($pushargs[0],0,1) eq "+")) {
> +				mw_push(substr($pushargs[0],1),$pushargs[1]);
> +			} else {
> +			       $loop = 0;
> +			}

Is "push" always forcing?

> +sub mw_import {
> +	my @wiki_name = split(/:\/\//,$url);
> +	my $wiki_name = $wiki_name[1];
> +
> +	my $mediawiki = MediaWiki::API->new;
> +	$mediawiki->{config}->{api_url} = "$url/api.php";
> +
> +	my $pages = $mediawiki->list({
> +		action => 'query',
> +		list => 'allpages',
> +		aplimit => 500,
> +	});
> +	if ($pages == undef) {
> +		print STDERR "fatal: '$url' does not appear to be a mediawiki\n";
> +		print STDERR "fatal: make sure '$url/api.php' is a valid page\n";
> +		exit;
> +	}
> +
> +	my @revisions;
> +	print STDERR "Searching revisions...\n";
> +	my $fetch_from = get_last_local_revision() + 1;
> +	my $n = 1;
> +	foreach my $page (@$pages) {
> +		my $id = $page->{pageid};
> +
> +		print STDERR "$n/", scalar(@$pages), ": $page->{title}\n";
> +		$n++;
> +
> +		my $query = {
> +			action => 'query',
> +			prop => 'revisions',
> +			rvprop => 'ids',
> +			rvdir => 'newer',
> +			rvstartid => $fetch_from,
> +			rvlimit => 500,
> +			pageids => $page->{pageid},
> +		};
> +
> +		my $revnum = 0;
> +		# Get 500 revisions at a time due to the mediawiki api limit

It's nice that you can dig deeper with rvlimit increments. I wonder if
'allpages' also let's you retrieve more than 500 pages in total by somehow
iterating over the set of pages.

> +	# Creation of the fast-import stream
> +	print STDERR "Fetching & writing export data...\n";
> +	binmode STDOUT, ':binary';
> +	$n = 0;
> +
> +	foreach my $pagerevids (sort {$a->{revid} <=> $b->{revid}} @revisions) {
> +		#fetch the content of the pages
> +		my $query = {
> +			action => 'query',
> +			prop => 'revisions',
> +			rvprop => 'content|timestamp|comment|user|ids',
> +			revids => $pagerevids->{revid},
> +		};
> +
> +		my $result = $mediawiki->api($query);
> +
> +		my $rev = pop(@{$result->{query}->{pages}->{$pagerevids->{pageid}}->{revisions}});

Is the list of per-page revisions guaranteed to be sorted (not a
rhetorical question; just asking)?

> +		print "commit refs/mediawiki/$remotename/master\n";
> +		print "mark :$n\n";
> +		print "committer $user <$user\@$wiki_name> ", $dt->epoch, " +0000\n";
> +		print "data ", bytes::length(encode_utf8($comment)), "\n", encode_utf8($comment);

Calling encode_utf8() twice on the same data?  How big is this $comment
typically?  Or does encode_utf8() somehow memoize?

> +		# If it's not a clone, needs to know where to start from
> +		if ($fetch_from != 1 && $n == 1) {
> +			print "from refs/mediawiki/$remotename/master^0\n";
> +		}
> +		print "M 644 inline $title.wiki\n";
> +		print "data ", bytes::length(encode_utf8($content)), "\n", encode_utf8($content);

Same for $content, which presumably is larger than $comment...

Perhaps a small helper

	sub literal_data {
        	my ($content) = @_;
                print "data ", bytes::length($content), "\n", $content;
	}

would help here, above, and below where you create a "note" on this
commit?

> +		# mediawiki revision number in the git note
> +		my $note_comment = encode_utf8("note added by git-mediawiki");
> +		my $note_comment_length = bytes::length($note_comment);
> +		my $note_content = encode_utf8("mediawiki_revision: " . $pagerevids->{revid} . "\n");
> +		my $note_content_length = bytes::length($note_content);
> +
> +		if ($fetch_from == 1 && $n == 1) {
> +			print "reset refs/notes/commits\n";
> +		}
> +		print "commit refs/notes/commits\n";
> +		print "committer $user <user\@example.com> ", $dt->epoch, " +0000\n";
> +		print "data ", $note_comment_length, "\n", $note_comment;

With that, this will become

	literal_data(encode_utf8("note added by git-mediawiki"));

and you don't need two extra variables.  Same for $note_content*.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html