Re: [PATCHv3 1/2] git-remote-mediawiki: import "File:" attachments

volekp <volekp@xxxxxxxxxxxxxxx> · Wed, 13 Jun 2012 15:37:52 +0200




On Tue, 12 Jun 2012 14:05:21 -0700, Junio C Hamano wrote:
Pavel Volek <Pavel.Volek@xxxxxxxxxxxxxxx> writes:

From: Pavel VOlek <Pavel.Volek@xxxxxxxxxxxxxxx>

Did you really mean this?  It does not match your S-o-b: line below.

You're right, I will change it.


The current version of the git-remote-mediawiki supports only import 
and export
of the pages, doesn't support import and export of file attachments 
which are
also exposed by MediaWiki API. This patch adds the functionality to 
import file
attachments and description pages for these files.

Chages version2 -> version3:
Fixes in comments.
Variable '$file' -> '$file_content' refactoring to be clearer.

These three lines do not belong here above the three-dash lines, I 
think.

OK. But should I mention somewhere the changes I performed from the 
precedent to
the actual version of the PATCH?

@@ -71,6 +68,9 @@ chomp(@tracked_pages);
 my @tracked_categories = split(/[ \n]/, run_git("config --get-all 
remote.". $remotename .".categories"));
 chomp(@tracked_categories);

+# Import media files too.
+my $import_media = run_git("config --get --bool remote.". 
$remotename .".mediaimport");
+
 my $wiki_login = run_git("config --get remote.". $remotename 
.".mwLogin");
 # TODO: ideally, this should be able to read from keyboard, but 
we're
 # inside a remote helper, so our stdin is connect to git, not to a
@@ -225,6 +225,10 @@ sub get_mw_pages {
 			get_mw_first_pages(\@slice, \%pages);
 			@some_pages = @some_pages[51..$#some_pages];
 		}
+
+		if ($import_media) {
+			get_mw_pages_for_linked_mediafiles(\@tracked_pages, \%pages);
+		}

I am guessing that the loop above is to avoid fetching and
processing too many pages at once.  Doesn't the call to
get_mw_pages_for_linked_mediafiles() need a similar consideration,
or what the function does is significantly different from what
get_mw_first_pages() does and there is no need to worry?

By the way, does it really have to be that overly long name?

Yes, it will be better to split the processing in a similar way. Now 
there is
a possibility that because of the MW limits, we won't get all media 
files.

For the name of the method I wanted to be precise and clear, I didn't 
find
more fitting name. Any suggestions welcomed.

@@ -244,6 +248,11 @@ sub get_mw_pages {
 			foreach my $page (@{$mw_pages}) {
 				$pages{$page->{title}} = $page;
 			}
+
+			if ($import_media) {
+				my @titles = map $_->{title}, @{$mw_pages};
+				get_mw_pages_for_linked_mediafiles(\@titles, \%pages);
+			}
 		}
 	}
 	if (!$user_defined) {
@@ -263,10 +272,186 @@ sub get_mw_pages {
 		foreach my $page (@{$mw_pages}) {
 			$pages{$page->{title}} = $page;
 		}
+
+		if ($import_media) {
+			# Attach list of all pages for media files from the API,
+			# they are in a different namespace, only one namespace
+			# can be queried at the same moment
+			my $mw_pages = $mediawiki->list({
+				action => 'query',
+				list => 'allpages',
+				apnamespace => get_mw_namespace_id("File"),
+				aplimit => 500
+			});
+			if (!defined($mw_pages)) {
+				print STDERR "fatal: could not get the list of pages for media 
files.\n";
+				print STDERR "fatal: '$url' does not appear to be a 
mediawiki\n";
+				print STDERR "fatal: make sure '$url/api.php' is a valid 
page.\n";
+				exit 1;
+			}
+			foreach my $page (@{$mw_pages}) {
+				$pages{$page->{title}} = $page;
+			}
+		}

For categories you need to call pages-for-mediafiles with the titles
you learned (the hunk starting at l.224), but there is no need to
call pages-for-mediafiles in this hunk?

Actually it is not necessary. In here the user doesn't specify the 
pages or
categories to be imported, so the whole wiki is imported. Thats why we 
can
skip the searching for relations between pages and files and directly 
ask
for all files.

+sub get_mw_pages_for_linked_mediafiles {
+	my $titles = shift;
+	my @titles = @{$titles};

Do you really need to make a copy of this array?  Wouldn't it
suffice to say

	my $mw_titles = join('|', @$titles);

at the only location in this function that uses this parameter?

You are right, I will change it.

+	my $pages = shift;
+
+	# pattern 'page1|page2|...' required by the API
+	my $mw_titles = join('|', @titles);

Nobody seems to be making sure there won't be more than 500 (I am
assuming that this script is considered a 'bot) pages in $mw_titles
variable.  Shouldn't the API call be split into such batches?

Right, it would be better.

+	# Media files could be included or linked from
+	# a page, get all related
+	my $query = {
+		action => 'query',
+		prop => 'links|images',
+		titles => $mw_titles,
+		plnamespace => get_mw_namespace_id("File"),
+		pllimit => 500
+	};
+	my $result = $mediawiki->api($query);
+


+
+# Return MediaWiki id for a canonical namespace name.
+# Ex.: "File", "Project".
+# Looks for the namespace id in the local configuration
+# variables, if it is not found asks MW API.
+sub get_mw_namespace_id {
+	mw_connect_maybe();
+
+	my $name = shift;
+
+	# Look at configuration file, if the record
+	# for that namespace is already stored.
+	# Namespaces are stored in form: "Name_of_namespace:Id_namespace",
+	# Ex.: "File:6".
+	my @tracked_namespaces = split(/[ \n]/, run_git("config --get-all 
remote.". $remotename .".namespaces"));
+	chomp(@tracked_namespaces);
+	if (@tracked_namespaces) {
+		foreach my $ns (@tracked_namespaces) {
+			my @ns_split = split(/:/, $ns);
+			if ($ns_split[0] eq $name) {
+				return $ns_split[1];
+			}
+		}
+	}

Does it make sense to call out to run_git() every time this function
is called?  Shoudln't this part be caching the result in a hash,
something like

	if (!exists $namespace_id{$name}) {
		@temp = ... run_git() ...;
                foreach my $ns (@temp) {
			my ($n, $s) = split(/:/, $ns);
			$namespace_id{$n} = $s;
		}
	}

	if (!exists $namespace_id{$name}) {
		... similarly, ask MW API and store in %namespace_id{}
	}

        if (exists $namespace_id{$name}) {
        	return $namespace_id{$name};
	}
	die "No such namespace $name";

Good idea, I will look closer on it!

Thanks for your review!

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html