Pavel Volek <Pavel.Volek@xxxxxxxxxxxxxxx> writes: > From: Pavel VOlek <Pavel.Volek@xxxxxxxxxxxxxxx> Did you really mean this? It does not match your S-o-b: line below. > > The current version of the git-remote-mediawiki supports only import and export > of the pages, doesn't support import and export of file attachments which are > also exposed by MediaWiki API. This patch adds the functionality to import file > attachments and description pages for these files. > > Chages version2 -> version3: > Fixes in comments. > Variable '$file' -> '$file_content' refactoring to be clearer. These three lines do not belong here above the three-dash lines, I think. > Signed-off-by: Pavel Volek <Pavel.Volek@xxxxxxxxxxxxxxx> > Signed-off-by: NGUYEN Kim Thuat <Kim-Thuat.Nguyen@xxxxxxxxxxxxxxx> > Signed-off-by: ROUCHER IGLESIAS Javier <roucherj@xxxxxxxxxxxxxxx> > Signed-off-by: Matthieu Moy <Matthieu.Moy@xxxxxxx> > --- > contrib/mw-to-git/git-remote-mediawiki | 223 ++++++++++++++++++++++++++++++++- > 1 file changed, 218 insertions(+), 5 deletions(-) > > diff --git a/contrib/mw-to-git/git-remote-mediawiki b/contrib/mw-to-git/git-remote-mediawiki > index c18bfa1..04d3959 100755 > --- a/contrib/mw-to-git/git-remote-mediawiki > +++ b/contrib/mw-to-git/git-remote-mediawiki > @@ -13,9 +13,6 @@ > # > # Known limitations: > # > -# - Only wiki pages are managed, no support for [[File:...]] > -# attachments. > -# Nice. > @@ -71,6 +68,9 @@ chomp(@tracked_pages); > my @tracked_categories = split(/[ \n]/, run_git("config --get-all remote.". $remotename .".categories")); > chomp(@tracked_categories); > > +# Import media files too. > +my $import_media = run_git("config --get --bool remote.". $remotename .".mediaimport"); > + > my $wiki_login = run_git("config --get remote.". $remotename .".mwLogin"); > # TODO: ideally, this should be able to read from keyboard, but we're > # inside a remote helper, so our stdin is connect to git, not to a > @@ -225,6 +225,10 @@ sub get_mw_pages { > get_mw_first_pages(\@slice, \%pages); > @some_pages = @some_pages[51..$#some_pages]; > } > + > + if ($import_media) { > + get_mw_pages_for_linked_mediafiles(\@tracked_pages, \%pages); > + } I am guessing that the loop above is to avoid fetching and processing too many pages at once. Doesn't the call to get_mw_pages_for_linked_mediafiles() need a similar consideration, or what the function does is significantly different from what get_mw_first_pages() does and there is no need to worry? By the way, does it really have to be that overly long name? > @@ -244,6 +248,11 @@ sub get_mw_pages { > foreach my $page (@{$mw_pages}) { > $pages{$page->{title}} = $page; > } > + > + if ($import_media) { > + my @titles = map $_->{title}, @{$mw_pages}; > + get_mw_pages_for_linked_mediafiles(\@titles, \%pages); > + } > } > } > if (!$user_defined) { > @@ -263,10 +272,186 @@ sub get_mw_pages { > foreach my $page (@{$mw_pages}) { > $pages{$page->{title}} = $page; > } > + > + if ($import_media) { > + # Attach list of all pages for media files from the API, > + # they are in a different namespace, only one namespace > + # can be queried at the same moment > + my $mw_pages = $mediawiki->list({ > + action => 'query', > + list => 'allpages', > + apnamespace => get_mw_namespace_id("File"), > + aplimit => 500 > + }); > + if (!defined($mw_pages)) { > + print STDERR "fatal: could not get the list of pages for media files.\n"; > + print STDERR "fatal: '$url' does not appear to be a mediawiki\n"; > + print STDERR "fatal: make sure '$url/api.php' is a valid page.\n"; > + exit 1; > + } > + foreach my $page (@{$mw_pages}) { > + $pages{$page->{title}} = $page; > + } > + } For categories you need to call pages-for-mediafiles with the titles you learned (the hunk starting at l.224), but there is no need to call pages-for-mediafiles in this hunk? Not a rhetorical question to suggest that you should; just wondering. > +sub get_mw_pages_for_linked_mediafiles { > + my $titles = shift; > + my @titles = @{$titles}; Do you really need to make a copy of this array? Wouldn't it suffice to say my $mw_titles = join('|', @$titles); at the only location in this function that uses this parameter? > + my $pages = shift; > + > + # pattern 'page1|page2|...' required by the API > + my $mw_titles = join('|', @titles); Nobody seems to be making sure there won't be more than 500 (I am assuming that this script is considered a 'bot) pages in $mw_titles variable. Shouldn't the API call be split into such batches? > + # Media files could be included or linked from > + # a page, get all related > + my $query = { > + action => 'query', > + prop => 'links|images', > + titles => $mw_titles, > + plnamespace => get_mw_namespace_id("File"), > + pllimit => 500 > + }; > + my $result = $mediawiki->api($query); > + > + while (my ($id, $page) = each(%{$result->{query}->{pages}})) { > + my @titles; > + if (defined($page->{links})) { > + my @link_titles = map $_->{title}, @{$page->{links}}; > + push(@titles, @link_titles); > + } > + if (defined($page->{images})) { > + my @image_titles = map $_->{title}, @{$page->{images}}; > + push(@titles, @image_titles); > + } > + if (@titles) { > + get_mw_first_pages(\@titles, \%{$pages}); > + } > + } > +} > + > +# Return MediaWiki id for a canonical namespace name. > +# Ex.: "File", "Project". > +# Looks for the namespace id in the local configuration > +# variables, if it is not found asks MW API. > +sub get_mw_namespace_id { > + mw_connect_maybe(); > + > + my $name = shift; > + > + # Look at configuration file, if the record > + # for that namespace is already stored. > + # Namespaces are stored in form: "Name_of_namespace:Id_namespace", > + # Ex.: "File:6". > + my @tracked_namespaces = split(/[ \n]/, run_git("config --get-all remote.". $remotename .".namespaces")); > + chomp(@tracked_namespaces); > + if (@tracked_namespaces) { > + foreach my $ns (@tracked_namespaces) { > + my @ns_split = split(/:/, $ns); > + if ($ns_split[0] eq $name) { > + return $ns_split[1]; > + } > + } > + } Does it make sense to call out to run_git() every time this function is called? Shoudln't this part be caching the result in a hash, something like if (!exists $namespace_id{$name}) { @temp = ... run_git() ...; foreach my $ns (@temp) { my ($n, $s) = split(/:/, $ns); $namespace_id{$n} = $s; } } if (!exists $namespace_id{$name}) { ... similarly, ask MW API and store in %namespace_id{} } if (exists $namespace_id{$name}) { return $namespace_id{$name}; } die "No such namespace $name"; -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html