On 08/06/2012 16:22, Pavel Volek wrote:
From: Volek Pavel<me@xxxxxxxxxxxxx> The current version of the git-remote-mediawiki supports only import and export of the pages, doesn't support import and export of file attachements which are also exposed by MediaWiki API. This patch adds the functionality to import the last versions of the files and all versions of description pages for these files. Signed-off-by: Pavel Volek<Pavel.Volek@xxxxxxxxxxxxxxx> Signed-off-by: NGUYEN Kim Thuat<Kim-Thuat.Nguyen@xxxxxxxxxxxxxxx> Signed-off-by: ROUCHER IGLESIAS Javier<roucherj@xxxxxxxxxxxxxxx> Signed-off-by: Matthieu Moy<Matthieu.Moy@xxxxxxx> ---
contrib/mw-to-git/git-remote-mediawiki | 290 +++++++++++++++++++++++++++------ 1 file changed, 244 insertions(+), 46 deletions(-)
I am wondering why are you showing the removal for a v1 patch ?
diff --git a/contrib/mw-to-git/git-remote-mediawiki b/contrib/mw-to-git/git-remote-mediawiki index c18bfa1..9f21217 100755 --- a/contrib/mw-to-git/git-remote-mediawiki +++ b/contrib/mw-to-git/git-remote-mediawiki @@ -212,59 +212,230 @@ sub get_mw_pages { my $user_defined; if (@tracked_pages) { $user_defined = 1; - # The user provided a list of pages titles, but we - # still need to query the API to get the page IDs. - - my @some_pages = @tracked_pages; - while (@some_pages) { - my $last = 50; - if ($#some_pages< $last) { - $last = $#some_pages; - } - my @slice = @some_pages[0..$last]; - get_mw_first_pages(\@slice, \%pages); - @some_pages = @some_pages[51..$#some_pages]; - } + get_mw_tracked_pages(\%pages); } if (@tracked_categories) { $user_defined = 1; - foreach my $category (@tracked_categories) { - if (index($category, ':')< 0) { - # Mediawiki requires the Category - # prefix, but let's not force the user - # to specify it. - $category = "Category:" . $category; - } - my $mw_pages = $mediawiki->list( { - action => 'query', - list => 'categorymembers', - cmtitle => $category, - cmlimit => 'max' } ) - || die $mediawiki->{error}->{code} . ': ' . $mediawiki->{error}->{details}; - foreach my $page (@{$mw_pages}) { - $pages{$page->{title}} = $page; - } - } + get_mw_tracked_categories(\%pages); } if (!$user_defined) { - # No user-provided list, get the list of pages from - # the API. - my $mw_pages = $mediawiki->list({ - action => 'query', - list => 'allpages', - aplimit => 500, - }); - if (!defined($mw_pages)) { - print STDERR "fatal: could not get the list of wiki pages.\n"; - print STDERR "fatal: '$url' does not appear to be a mediawiki\n"; - print STDERR "fatal: make sure '$url/api.php' is a valid page.\n"; - exit 1; + get_mw_all_pages(\%pages); + } + return values(%pages); +} + +sub get_mw_all_pages { + my $pages = shift; + # No user-provided list, get the list of pages from the API. + my $mw_pages = $mediawiki->list({ + action => 'query', + list => 'allpages', + aplimit => 500, + }); + if (!defined($mw_pages)) { + print STDERR "fatal: could not get the list of wiki pages.\n"; + print STDERR "fatal: '$url' does not appear to be a mediawiki\n"; + print STDERR "fatal: make sure '$url/api.php' is a valid page.\n"; + exit 1; + } + foreach my $page (@{$mw_pages}) { + $pages->{$page->{title}} = $page; + } + + # Attach list of all pages for meadia files from the API, + # they are in a different namespace, only one namespace + # can be queried at the same moment + my $mw_mediapages = $mediawiki->list({ + action => 'query', + list => 'allpages', + apnamespace => get_mw_namespace_id("File"), + aplimit => 500, + }); + if (!defined($mw_mediapages)) { + print STDERR "fatal: could not get the list of media file pages.\n"; + print STDERR "fatal: '$url' does not appear to be a mediawiki\n"; + print STDERR "fatal: make sure '$url/api.php' is a valid page.\n"; + exit 1; + } + foreach my $page (@{$mw_mediapages}) { + $pages->{$page->{title}} = $page; + } +} + +sub get_mw_tracked_pages { + my $pages = shift; + # The user provided a list of pages titles, but we + # still need to query the API to get the page IDs. + my @some_pages = @tracked_pages; + while (@some_pages) { + my $last = 50; + if ($#some_pages< $last) { + $last = $#some_pages; + } + my @slice = @some_pages[0..$last]; + get_mw_first_pages(\@slice, \%{$pages}); + @some_pages = @some_pages[51..$#some_pages]; + } + + # Get pages of related media files. + get_mw_linked_mediapages(\@tracked_pages, \%{$pages}); +} + +sub get_mw_tracked_categories { + my $pages = shift; + foreach my $category (@tracked_categories) { + if (index($category, ':')< 0) { + # Mediawiki requires the Category + # prefix, but let's not force the user + # to specify it. + $category = "Category:" . $category; } + my $mw_pages = $mediawiki->list( { + action => 'query', + list => 'categorymembers', + cmtitle => $category, + cmlimit => 'max' } ) + || die $mediawiki->{error}->{code} . ': ' + . $mediawiki->{error}->{details}; foreach my $page (@{$mw_pages}) { - $pages{$page->{title}} = $page; + $pages->{$page->{title}} = $page; + } + + my @titles = map $_->{title}, @{$mw_pages}; + # Get pages of related media files. + get_mw_linked_mediapages(\@titles, \%{$pages}); + } +} + +sub get_mw_linked_mediapages { + my $titles = shift; + my @titles = @{$titles}; + my $pages = shift; + + # pattern 'page1|page2|...' required by the API + my $mw_titles = join('|', @titles); + + # Media files could be included or linked from + # a page, get all related + my $query = { + action => 'query', + prop => 'links|images', + titles => $mw_titles, + plnamespace => get_mw_namespace_id("File"), + pllimit => 500, + };
Why a comma after 500 ?
+ my $result = $mediawiki->api($query);
What happened if the titles in the query contains special character which are not allowed by mediawiki for filename like { or [. Maybe you should build a test for it and if it doesn't work try out the functions called:
mediawiki_clean/smudge_filename in the file git-remote-mediawiki
+ + while (my ($id, $page) = each(%{$result->{query}->{pages}})) { + my @titles; + if (defined($page->{links})) { + my @link_titles = map $_->{title}, @{$page->{links}}; + push(@titles, @link_titles); + } + if (defined($page->{images})) { + my @image_titles = map $_->{title}, @{$page->{images}}; + push(@titles, @image_titles); + } + if (@titles) { + get_mw_first_pages(\@titles, \%{$pages}); } } - return values(%pages); +} + +sub get_mw_medafile_for_mediapage_revision { + # Name of the file on Wiki, with the prefix. + my $mw_filename = shift; + my $timestamp = shift; + my %mediafile; + + # Search if on MediaWiki exists a media file with given + # timestamp and in that case download the file. + my $query = { + action => 'query', + prop => 'imageinfo', + titles => $mw_filename, + iistart => $timestamp, + iiend => $timestamp, + iiprop => 'timestamp|archivename', + iilimit => 1, + };
Why a comma after iilimit ? (end of list of parameter here I think...)
+ my $result = $mediawiki->api($query); + + my ($fileid, $file) = each ( %{$result->{query}->{pages}} ); + if (defined($file->{imageinfo})) { + my $fileinfo = pop(@{$file->{imageinfo}}); + if (defined($fileinfo->{archivename})) { + return; # now we are not able to download files from archive + } + + my $filename; # real filename without prefix + if (index($mw_filename, 'File:') == 0) { + $filename = substr $mw_filename, 5; + } else { + $filename = substr $mw_filename, 6; + } + + $mediafile{title} = $filename; + $mediafile{content} = download_mw_mediafile($mw_filename); + } + return %mediafile; +} + +# Returns MediaWiki id for a canonical namespace name. +# Ex.: "File", "Project". +# Looks for the namespace id in the local configuration +# variables, if it is not found asks MW API. +sub get_mw_namespace_id { + mw_connect_maybe(); + + my $name = shift; + + # Look at configuration file, if the record + # for that namespace is already stored. + my @tracked_namespaces = split(/[ \n]/, run_git("config --get-all remote.". $remotename .".namespaces"));
Broken indentation/line too long ?
+ + # NS not found => get namespace id from MW and store it in + # configuration file. + my $query = { + action => 'query', + meta => 'siteinfo', + siprop => 'namespaces', + };
Same here concerning comma.
+ my $result = $mediawiki->api($query); + + while (my ($id, $ns) = each(%{$result->{query}->{namespaces}})) { + if (defined($ns->{canonical})&& ($ns->{canonical} eq $name)) { + run_git("config --add remote.". $remotename .".namespaces ". $name ."=". $ns->{id}); + return $ns->{id}; + } + } + die "Namespace $name was not found on MediaWiki."; +} + +sub download_mw_mediafile { + my $filename = shift; + + $mediawiki->{config}->{files_url} = $url; + + my $file = $mediawiki->download( { title => $filename } );
Just wondering: What happened if $filename contains some forbidden character on wiki's filename such as '{' or '|' ? I am worrying about it because i've got some similar issues in my own work on tests for git-remote-mediawiki.
Hope I helped :). Simon -- CATHEBRAS Simon 2A-ENSIMAG Filière Ingéniérie des Systèmes d'Information Membre Bug-Buster -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html