On Thu, Jun 02, 2011 at 11:28:31AM +0200, Arnaud Lacurie wrote: > +sub mw_import { > [...] > + # Get 500 revisions at a time due to the mediawiki api limit > + while (1) { > + my $result = $mediawiki->api($query); > + > + # Parse each of those 500 revisions > + foreach my $revision (@{$result->{query}->{pages}->{$id}->{revisions}}) { > + my $page_rev_ids; > + $page_rev_ids->{pageid} = $page->{pageid}; > + $page_rev_ids->{revid} = $revision->{revid}; > + push (@revisions, $page_rev_ids); > + $revnum++; > + } > + last unless $result->{'query-continue'}; > + $query->{rvstartid} = $result->{'query-continue'}->{revisions}->{rvstartid}; > + print "\n"; > + } What is this newline at the end here for? With it, my import reliably fails with: fatal: Unsupported command: fast-import: dumping crash report to .git/fast_import_crash_6091 Removing it seems to make things work. > + my $user = $rev->{user} || 'Anonymous'; > + my $dt = DateTime::Format::ISO8601->parse_datetime($rev->{timestamp}); > + > + my $comment = defined $rev->{comment} ? $rev->{comment} : '*Empty MediaWiki Message*'; In importing the git wiki, I ran into an empty timestamp. This throws an exception which kills the whole import: $ git clone mediawiki::https://git.wiki.kernel.org/ git-wiki 2821/7949: Revision nÂ4210 of GitSurvey Invalid date format: at /home/peff/compile/git/contrib/mw-to-git/git-remote-mediawiki line 195 main::mw_import('https://git.wiki.kernel.org/') called at /home/peff/compile/git/contrib/mw-to-git/git-remote-mediawiki line 42 At the very least, we should intercept this and put in some placeholder timestamp. I'm not sure what the best placeholder would be. Maybe use the date from the previous revision, plus one second? Or maybe there is some other bug causing us to have an empty timestamp. I didn't dig deeper yet. > + # mediawiki revision number in the git note > + my $note_comment = encode_utf8("note added by git-mediawiki"); > + my $note_comment_length = bytes::length($note_comment); > + my $note_content = encode_utf8("mediawiki_revision: " . $pagerevids->{revid} . "\n"); > + my $note_content_length = bytes::length($note_content); > + > + if ($fetch_from == 1 && $n == 1) { > + print "reset refs/notes/commits\n"; > + } > + print "commit refs/notes/commits\n"; Should these go in refs/notes/commits? I don't think we have a "best practices" yet for the notes namespaces, as it is still a relatively new concept. But I always thought "refs/notes/commits" would be for the user's "regular" notes, and that programmatic things would get their own notes, like "refs/notes/mediawiki". That wouldn't show them by default, but you could do: git log --notes=mediawiki to see them (and maybe that is a feature, because most of the time you won't care about the mediawiki revision). > + } else { > + print STDERR "You appear to have cloned an empty mediawiki\n"; > + #What do we have to do here ? If nothing is done, an error is thrown saying that > + #HEAD is refering to unknown object 0000000000000000000 > + } Hmm. We do allow cloning empty git repos. It might be nice for there to be some way for a remote helper to signal "everything OK, but the result is empty". But I think that is probably something that needs to be added to the remote-helper protocol, and so is outside the scope of your script (maybe it is as simple as interpreting the null sha1 as "empty"; I dunno). Overall, it's looking pretty good. I like that I can resume a half-finished import via "git fetch". Though I do have one complaint: running "git fetch" fetches the metainfo for every revision of every page, just as it does for an initial clone. Is there something in the mediawiki API to say "show me revisions since N" (where N would be the mediawiki revision of the tip of what we imported)? -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html