|
Wikipedia:Computer help desk/ParseMediaWikiDump
|
Parse::MediaWikiDump is a Perl module created by Triddle that makes accessing the information in a MediaWiki dump file easy.
Download
The latest version of Parse::MediaWikiDump is available at http://search.cpan.org/perldoc?Parse::MediaWikiDump
Examples
Find uncategorized articles in the main name space
#!/usr/bin/perl -w
use strict;
use Parse::MediaWikiDump;
my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;
while(defined($page = $pages->page)) {
#main namespace only
next unless $page->namespace eq '';
print $page->title, "\n" unless defined($page->categories);
}
Find double redirects in the main name space
This program does not follow the proper case sensitivity rules for matching article titles; see the documentation that comes with the module for a much more complete version of this program.
#!/usr/bin/perl -w
use strict;
use Parse::MediaWikiDump;
my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;
my %redirs;
while(defined($page = $pages->page)) {
next unless $page->namespace eq '';
next unless defined($page->redirect);
my $title = $page->title;
$redirs{$title} = $page->redirect;
}
foreach my $key (keys(%redirs)) {
my $redirect = $redirs{$key};
if (defined($redirs{$redirect})) {
print "$key\n";
}
}
Import only a certain category of pages
#!/usr/bin/perl
use Parse::MediaWikiDump;
use DBI;
use DBD::mysql;
$server = "localhost";
$name = "dbname";
$user = "admin";
$password = "pass";
$dsn = "DBI:mysql:database=$name;host=$server;";
$dbh = DBI->connect($dsn, $user, $password);
$source = 'pages_articles.xml';
$pages = Parse::MediaWikiDump::Pages->new($source);
print "Done parsing.\n";
while(defined($page = $pages->page)) {
$c = $page->categories;
if (grep {/Mathematics/} @$c) { # all categories with the string "Mathematics" anywhere in their text.
# For exact match, use {$_ eq "Mathematics"}
$id = $page->id;
$title = $page->title;
$text = $page->text;
#$dbh->do("insert ..."); #details of SQL depend on the database setup
print "title '$title' id $id was inserted.\n";
}
}
Extract articles linked to important Wikis but not to a specific one
The script checks if an article contains interwikis to :de, :es, :it, :ja and :nl BUT not :fr. It is useful to link "popular" articles to a specific wiki. It may also give useful hints about articles that should be translated in priority.
#!/usr/bin/perl -w
# Code : Dake
use strict;
use Parse::MediaWikiDump;
use utf8;
my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;
binmode STDOUT, ":utf8";
while(defined($page = $pages->page)) {
#main namespace only
next unless $page->namespace eq '';
my $text = $page->text;
if (($$text =~ /\\de:/i) && ($$text =~ /\\es:/i) &&
($$text =~ /\\nl:/i) && ($$text =~ /\\ja:/i) &&
($$text =~ /\\it:/i) && !($$text =~ /\\fr:/i))
{
print $page->title, "\n";
}
}
Related software
|