Unpublished works of presumably academic quality are listed in a dedicated section. For non-academic research, as well as tools that may be useful in researching Wikipedia, see Wikipedia:Researching Wikipedia. For a WikiProject focussed on doing research on Wikipedia, see Wikipedia:WikiProject Wikidemia.
| Authors |
Title |
Conference / published in |
Year |
Online |
Notes |
Abstract |
Keywords
|
| Torsten Zesch, Christof Müller and Iryna Gurevych |
Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary |
LREC'08 |
2008 |
[1] |
|
Recently, collaboratively constructed resources such as Wikipedia and Wiktionary have been discovered as valuable lexical semantic knowledge bases with a high potential in diverse Natural Language Processing (NLP) tasks. Collaborative knowledge bases however significantly differ from traditional linguistic knowledge bases in various respects, and this constitutes both an asset and an impediment for research in NLP. This paper addresses one such major impediment, namely the lack of suitable programmatic access mechanisms to the knowledge stored in these large semantic knowledge bases. We present two application programming interfaces for Wikipedia and Wiktionary which are especially designed for mining the rich lexical semantic information dispersed in the knowledge bases, and provide efficient and structured access to the available knowledge. As we believe them to be of general interest to the NLP community, we have made them freely available for research purposes.
|
|
| Michael Roth and Sabine Schulte im Walde |
Corpus Co-Occurrence, Dictionary and Wikipedia Entries as Resources for Semantic Relatedness Information |
LREC'08 |
2008 |
[2] |
|
Distributional, corpus-based descriptions have frequently been applied to model aspects of word meaning. However, distributional models that use corpus data as their basis have one well-known disadvantage: even though the distributional features based on corpus co-occurrence were often successful in capturing meaning aspects of the words to be described, they generally fail to capture those meaning aspects that refer to world knowledge, because coherent texts tend not to provide redundant information that is presumably available knowledge. The question we ask in this paper is whether dictionary and encyclopaedic resources might complement the distributional information in corpus data, and provide world knowledge that is missing in corpora. As test case for meaning aspects, we rely on a collection of semantic associates to German verbs and nouns. Our results indicate that a combination of the knowledge resources should be helpful in work on distributional descriptions.
|
|
| Laura Kassner, Vivi Nastase and Michael Strube |
Acquiring a Taxonomy from the German Wikipedia |
LREC'08 |
2008 |
[3] |
|
This paper presents the process of acquiring a large, domain independent, taxonomy from the German Wikipedia. We build upon a previously implemented platform that extracts a semantic network and taxonomy from the English version of the Wikipedia. We describe two accomplishments of our work: the semantic network for the German language in which isa links are identified and annotated, and an expansion of the platform for easy adaptation for a new language. We identify the platform’s strengths and shortcomings, which stem from the scarcity of free processing resources for languages other than English. We show that the taxonomy induction process is highly reliable - evaluated against the German version of WordNet, GermaNet, the resource obtained shows an accuracy of 83.34%.
|
|
| Jordi Atserias, Hugo Zaragoza, Massimiliano Ciaramita and Giuseppe Attardi |
Semantically Annotated Snapshot of the English Wikipedia |
LREC'08 |
2008 |
[4] |
|
This paper describes SW1, the first version of a semantically annotated snapshot of the English Wikipedia. In recent years Wikipedia has become a valuable resource for both the Natural Language Processing (NLP) community and the Information Retrieval (IR) community. Although NLP technology for processing Wikipedia already exists, not all researchers and developers have the computational resources to process such a volume of information. Moreover, the use of different versions of Wikipedia processed differently might make it difficult to compare results. The aim of this work is to provide easy access to syntactic and semantic annotations for researchers of both NLP and IR communities by building a reference corpus to homogenize experiments and make results comparable. These resources, a semantically annotated corpus and a “entity containment” derived graph, are licensed under the GNU Free Documentation License and available from http://www.yr-bcn.es/semanticWikipedia
|
|
| Adrian Iftene and Alexandra Balahur-Dobrescu |
Named Entity Relation Mining using Wikipedia |
LREC'08 |
2008 |
[5] |
|
Discovering relations among Named Entities (NEs) from large corpora is both a challenging, as well as useful task in the domain of Natural Language Processing, with applications in Information Retrieval (IR), Summarization (SUM), Question Answering (QA) and Textual Entailment (TE). The work we present resulted from the attempt to solve practical issues we were confronted with while building systems for the tasks of Textual Entailment Recognition and Question Answering, respectively. The approach consists in applying grammar induced extraction patterns on a large corpus - Wikipedia - for the extraction of relations between a given Named Entity and other Named Entities. The results obtained are high in precision, determining a reliable and useful application of the built resource.
|
|
| Gaoying Cui, Qin Lu, Wenjie Li and Yirong Chen |
Corpus Exploitation from Wikipedia for Ontology Construction |
LREC'08 |
2008 |
[6] |
|
Ontology construction usually requires a domain-specific corpus for building corresponding concept hierarchy. The domain corpus must have a good coverage of domain knowledge. Wikipedia(Wiki), the world’s largest online encyclopaedic knowledge source, is open-content, collaboratively edited, and free of charge. It covers millions of articles and still keeps on expanding continuously. These characteristics make Wiki a good candidate as domain corpus resource in ontology construction. However, the selected article collection must have considerable quality and quantity. In this paper, a novel approach is proposed to identify articles in Wiki as domain-specific corpus by using available classification information in Wiki pages. The main idea is to generate a domain hierarchy from the hyperlinked pages of Wiki. Only articles strongly linked to this hierarchy are selected as the domain corpus. The proposed approach makes use of linked category information in Wiki pages to produce the hierarchy as a directed graph for obtaining a set of pages in the same connected branch. Ranking and filtering are then done on these pages based on the classification tree generated by the traversal algorithm. The experiment and evaluation results show that Wiki is a good resource for acquiring a relative high quality domain-specific corpus for ontology construction.
|
|
| Alexander E. Richman, Patrick Schone |
Mining Wiki Resources for Multilingual Named Entity Recognition |
ACL-08: HLT, pp. 1–9 |
2008 |
[7] |
|
In this paper, we describe a system by which
the multilingual characteristics of Wikipedia can be utilized to annotate a large corpus of text with Named Entity Recognition (NER) tags requiring minimal human intervention and no linguistic expertise. This process, though of value in languages for which resources exist, is particularly useful for less commonly taught languages. We show how the Wikipedia format can be used to identify possible named entities and discuss in detail the process by which we use the Category structure inherent to Wikipedia to determine the named entity type of a proposed entity. We further describe the methods by which English language data can be used to bootstrap the NER process in other languages. We demonstrate the system by using the generated corpus as training sets for a variant of BBN's Identifinder in French, Ukrainian, Spanish, Polish, Russian, and Portuguese, achieving overall F-scores as high as 84.7% on independent, human-annotated corpora, comparable to a system trained on up to
40,000 words of human-annotated newswire.
|
|
| Michael Kaisser |
The QuALiM Question Answering Demo: Supplementing Answers with Paragraphs drawn from Wikipedia |
ACL-08: HLT Demo Session, pp. 32–35 |
2008 |
[8] |
|
This paper describes the online demo of the
QuALiM Question Answering system. While the system actually gets answers from the web by querying major search engines, during presentation answers are supplemented with relevant passages from Wikipedia. We believe that this additional information improves a
user’s search experience.
|
|
| Elif Yamangil, Rani Nelken |
Mining Wikipedia Revision Histories for Improving Sentence Compression |
ACL-08: HLT, Short Papers, pp. 137–140 |
2008 |
[9] |
|
A well-recognized limitation of research on
supervised sentence compression is the dearth of available training data. We propose a new and bountiful resource for such training data, which we obtain by mining the revision history of Wikipedia for sentence compressions and expansions. Using only a fraction of the available Wikipedia data, we have collected a training corpus of over 380,000 sentence pairs, two orders of magnitude larger than the standardly used Ziff-Davis corpus. Using this newfound data, we propose a novel lexicalized noisy channel model for sentence compression, achieving improved results in grammaticality and compression rate criteria with a
slight decrease in importance.
|
|
| Fadi Biadsy, Julia Hirschberg, Elena Filatova |
An Unsupervised Approach to Biography Production using Wikipedia |
ACL-08: HLT, pp. 807–815 |
2008 |
[10] |
|
We describe an unsupervised approach to
multi-document sentence-extraction based summarization for the task of producing biographies. We utilize Wikipedia to automatically construct a corpus of biographical sentences and TDT4 to construct a corpus of non-biographical sentences. We build a biographical-sentence classifier from these corpora and an SVM regression model for sentence ordering from the Wikipedia corpus. We evaluate our work on the DUC2004 evaluation data and with human judges. Overall, our system significantly outperforms all systems that participated in DUC2004, according to the ROUGE-L metric, and is
preferred by human subjects.
|
|
| Kai Wang, Chien-Liang Lin, Chun-Der Chen, and Shu-Chen Yang |
The adoption of Wikipedia: a community- and information quality-basaed view |
12th Pacific Asia Conference on Information Systems (PACIS) |
2008 |
[11] |
Wikipedia-Lab work |
|
TAM, Wikipedia, Critical Mass, Community identification, Information quality |
| Carlo A. Curino, Hyun J. Moon, Letizia Tanca, Carlo Zaniolo |
Schema Evolution in Wikipedia: toward a Web Information System Benchmark |
International Conference on Enterprise Information System (ICEIS), |
2008 |
[12] |
Panta Rhei Project |
Evolving the database that is at the core of an Information System
represents a difficult maintenance problem that has only been studied in the framework of traditional information systems. However, the problem is likely to be even more severe in web information systems, where open-source software is often developed through the contributions and collaboration of many groups and individuals. Therefore, in this paper, we present an in-depth analysis of the evolution history of the Wikipedia database and its schema; Wikipedia is the best-known example of a large family of web information systems built using the open-source software MediaWiki. Our study is based on: (i) a set of Schema Modification Operators that provide a simple conceptual representation for complex schema changes, and (ii) simple software tools to automate the analysis. This framework allowed us to dissect and analyze the 4.5 years of Wikipedia history, which was short in time, but intense in terms of growth and evolution. Beyond confirming the initial hunch about the severity of the problem, our analysis suggests the need for developing better methods and tools to support graceful schema evolution. Therefore, we briefly discuss documentation and automation support systems for database evolution, and suggest that the Wikipedia case study can provide the kernel of a benchmark for testing and improving
such systems.
|
Schema Evolution, Benchmark, Schema Versioning, Query Rewriting
|
| Carlo A. Curino, Hyun J. Moon, Carlo Zaniolo |
Graceful Database Schema Evolution: the PRISM Workbench |
Very Large DataBases (VLDB), |
2008 |
[] |
Panta Rhei Project |
Supporting graceful schema evolution represents an unsolved
problem for traditional information systems that is further exacerbated in web information systems, such as Wikipedia and public scientific databases: in these pro jects based on multiparty cooperation the frequency of database schema changes has increased while tolerance for downtimes has nearly disappeared. As of today, schema evolution remains an error-prone and time-consuming undertaking, because the DB Administrator (DBA) lacks the methods and tools needed to manage and automate this endeavor by (i) pre- dicting and evaluating the effects of the proposed schema changes, (ii) rewriting queries and applications to operate on the new schema, and (iii) migrating the database. Our PRISM system takes a big first step toward ad- dressing this pressing need by providing: (i) a language of Schema Modification Operators to express concisely com- plex schema changes, (ii) tools that allow the DBA to eval- uate the effects of such changes, (iii) optimized translation of old queries to work on the new schema version, (iv) au- tomatic data migration, and (v) full documentation of in- tervened changes as needed to support data provenance, database flash back, and historical queries. PRISM solves these problems by integrating recent theoretical advances on mapping composition and invertibility, into a design that also achieves usability and scalability. Wikipedia and its 170+ schema versions provided an invaluable testbed for val- idating tools and their ability to support legacy
queries.
|
Schema Evolution, Graceful Evolution, Schema Versioning, Query Rewriting |
| Hyun J. Moon, Carlo A. Curino, Alin Deutsch, Chien-Yi Hou, Carlo Zaniolo |
Managing and Querying Transaction-time Databases under Schema Evolution |
Very Large DataBases (VLDB), |
2008 |
[] |
Panta Rhei Project |
The old problem of managing the history of database in-
formation is now made more urgent and complex by fast- spreading web information systems. Indeed, systems such as Wikipedia are faced with the challenge of managing the history of their databases in the face of intense database schema evolution. Our PRIMA system addresses this dif- ficult problem by introducing two key pieces of new tech- nology. The first is a method for publishing the history of a relational database in XML, whereby the evolution of the schema and its underlying database are given a uni- fied representation. This temporally grouped representation makes it easy to formulate sophisticated historical queries on any given schema version using standard XQuery. The second key piece of technology provided by PRIMA is that schema evolution is transparent to the user: she writes queries against the current schema while retrieving the data from one or more schema versions. The system then per- forms the labor-intensive and error-prone task of rewriting such queries into equivalent ones for the appropriate ver- sions of the schema. This feature is particularly relevant for historical queries spanning over potentially hundreds of different schema versions. The latter one is realized by (i) introducing Schema Modification Operators (SMOs) to represent the mappings between successive schema versions and (ii) an XML integrity constraint language (XIC) to efficiently rewrite the queries using the constraints established by the SMOs. The scalability of the approach has been tested against both synthetic data and real-world data from
the Wikipedia DB schema evolution history.
|
Schema Evolution, Transaction Time DB, Query Rewriting |
| Fogarolli Angela and Ronchetti Marco |
Intelligent Mining and Indexing of Multi-Language e-Learning Material |
Proc. of 1st International Symposium on Intelligent Interactive Multimedia Systems and Services, KES IIMS 2008, 9-11 July 2008 Piraeus, Greece Studies in Computational Intelligence, Springer-Verlag (2008). Note: to appear. |
2008 |
|
|
In this paper we describe a method to automatically discover important concepts and their relationships in e-Lecture material. The discovered knowledge is used to display semantic aware categorizations and query suggestions for facilitating navigation inside an unstructured multimedia repository of e-Lectures. We report about an implemented approach for dealing with learning materials referring to the same event in different languages. The information acquired from the speech is combined with the documents such as presentation slides which are temporally synchronized with the video for creating new knowledge through a mapping with a taxonomy representation such as Wikipedia.
|
Content Retrieval, Content Filtering, Search over semi-structural Web sources, Multimedia, e-Learning
|
| Fogarolli Angela and Ronchetti Marco |
Discovering Semantics in Multimedia Content using Wikipedia |
Proc. of 11th BIS 2008, 5-7 May 2008 Innsbruck, Austria. Lecture Notes in Business Information Processing, pp. 48–57. Springer, Heidelberg (2008) |
2008 |
|
|
Semantic-based information retrieval is an area of ongoing work. In this paper we present a solution for giving semantic support to multimedia content information retrieval in an e-Learning environment where very often a large number of multimedia objects and information sources are used in combination. Semantic support is given through intelligent use of Wikipedia in combination with statistical Information Extraction techniques.
|
Content Retrieval, Content Filtering, Search over semi-structural Web sources, Multimedia, e-Learning |
| Tyers, F. and Pienaar, J. |
Extracting bilingual word pairs from Wikipedia |
SALTMIL workshop at Language Resources and Evaluation Conference (LREC) 2008, (To appear) |
2008 |
|
|
A bilingual dictionary or word list is an important resource for many purposes, among them, machine translation. For many language
pairs these are either non-existent, or very often unavailable owing to licensing restrictions. We describe a simple, fast and computa- tionally inexpensive method for extracting bilingual dictionary entries from Wikipedia (using the interwiki link system) and assess the performance of this method with respect to four language pairs. Precision was found to be in the 69–92% region, but open to
improvement.
|
Under-resourced languages, Machine translation, Language resources, Bilingual terminology, Interwiki links |
| Fei Wu, Daniel S. Weld |
Automatically Refining the Wikipedia Infobox Ontology |
17th International World Wide Web Conference (www-08) |
2008 |
[13] |
The Intelligence in Wikipedia Project at University of Washington |
The combined efforts of human volunteers have recently extracted numerous facts fromWikipedia, storing them asmachine-harvestable object-attribute-value triples inWikipedia infoboxes. Machine learning systems, such as Kylin, use these infoboxes as training data, accurately extracting even more semantic knowledge from natural language text. But in order to realize the full power of this information, it must be situated in a cleanly-structured ontology. This paper introduces KOG, an autonomous system for refining Wikipedia’s infobox-class ontology towards this end. We cast the problem of ontology refinement as a machine learning problem and solve it
using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Networks. We present experiments demonstrating the superiority of the joint-inference approach and evaluating other aspects of our system. Using these techniques, we build a rich ontology, integratingWikipedia’s infobox-class schemata
with WordNet. We demonstrate how the resulting ontology may be used to enhance Wikipedia with improved query processing and other features.
|
Semantic Web, Ontology, Wikipedia, Markov Logic Networks |
| Maike Erdmann, Kotaro Nakayama, Takahiro Hara, Sojiro Nishio |
An Approach for Extracting Bilingual Terminology from Wikipedia |
13th International Conference on Database Systems for Advanced Applications (DASFAA, To appear) |
2008 |
[14] |
Wikipedia-Lab work |
With the demand of bilingual dictionaries covering domain-specific terminology, research in the field of automatic dictionary extraction has become popular. However, accuracy and coverage of dictionaries created based on bilingual text corpora are often not sufficient for domain-specific terms. Therefore, we present an approach to extracting bilingual dictionaries from the link structure of Wikipedia, a huge scale encyclopedia that contains a vast amount of links between articles in different languages. Our methods analyze not only these interlanguage links but extract even more translation candidates from redirect page and link text information. In an experiment, we proved the advantages of our methods compared to a traditional approach of extracting bilingual terminology from parallel corpora.
|
Wikipedia Mining, Bilingual Terminology, Link Structure Analysis |
| Kotaro Nakayama, Takahiro Hara, Sojiro Nishio |
A Search Engine for Browsing the Wikipedia Thesaurus |
13th International Conference on Database Systems for Advanced Applications, Demo session (DASFAA, To appear) |
2008 |
[15] |
Wikipedia-Lab work |
Wikipedia has become a huge phenomenon on the WWW. As a corpus for knowledge extraction, it has various impressive characteristics such as a huge amount of articles, live updates, a dense link structure, brief link texts and URL identification for concepts. In our previous work, we proposed link structure mining algorithms to extract a huge scale and accurate association thesaurus from Wikipedia. The association thesaurus covers almost 1.3 million concepts and the significant accuracy is proved in detailed experiments. To prove its practicality, we implemented three features on the association thesaurus; a search engine for browsing Wikipedia Thesaurus, an XML Web service for the thesaurus and a Semantic Web support feature. We show these features in this demonstration.
|
Wikipedia Mining, Association Thesaurus, Link Structure Analysis, XML Web Services |
| Kotaro Nakayama, Masahiro Ito, Takahiro Hara, Sojiro Nishio |
Wikipedia Mining for Huge Scale Japanese Association Thesaurus Construction |
International Symposium on Mining And Web (IEEE MAW, To appear) conjunction with IEEE AINA |
2008 |
[16] |
Wikipedia-Lab work |
|
Wikipedia Mining, Association Thesaurus, Link Structure Analysis
|
| Minghua Pei, Kotaro Nakayama, Takahiro Hara, Sojiro Nishio |
Constructing a Global Ontology by Concept Mapping using Wikipedia Thesaurus |
International Symposium on Mining And Web (IEEE MAW, To appear) conjunction with IEEE AINA |
2008 |
[17] |
Wikipedia-Lab work |
|
Wikipedia Mining, Association Thesaurus, Ontology Mapping, Global Ontology |
| Joachim Schroer, Guido Hertel |
Voluntary engagement in an open web-based encyclopedia: From reading to contributing |
10th International General Online Research Conference, Hamburg, Germany |
2008 |
[18] |
|
|
wikipedia, contributors, motivation, instrumentality, intrinsic motivation |
| Martin Potthast, Benno Stein, Maik Anderka |
A Wikipedia-Based Multilingual Retrieval Model |
30th European Conference on IR Research, ECIR 2008, Glasgow |
2008 |
[19] |
|
This paper introduces CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The retrieval model exploits the multilingual alignment of Wikipedia: given a document d written in language L we construct a concept vector d for d, where each dimension i in d quantifies the similarity of d with respect to a document d *i chosen from the " L-subset" of Wikipedia. Likewise, for a second document d‘ written in language L‘, L≠ L‘, we construct a concept vector d‘, using from the L‘-subset of the Wikipedia the topic-aligned counterparts d‘ *i of our previously chosen documents.
Since the two concept vectors d and d‘ are collection-relative representations of d and d‘ they are language-independent. I.e., their similarity can directly be computed with the cosine similarity measure, for instance.
We present results of an extensive analysis that demonstrates the power of this new retrieval model: for a query document d the topically most similar documents from a corpus in another language are properly ranked. Salient property of the new retrieval model is its robustness with respect to both the size and the quality of the index document collection.
|
multilingual retrieval model, explicit semantic analysis, wikipedia |
| Martin Potthast, Benno Stein, Robert Gerling |
Automatic Vandalism Detection in Wikipedia |
30th European Conference on IR Research, ECIR 2008, Glasgow |
2008 |
[20] |
|
We present results of a new approach to detect destructive article revisions, so-called vandalism, in Wikipedia. Vandalism detection is a one-class classification problem, where vandalism edits are the target to be identified among all revisions. Interestingly, vandalism detection has not been addressed in the Information Retrieval literature by now. In this paper we discuss the characteristics of vandalism as humans recognize it and develop features to render vandalism detection as a machine learning task. We compiled a large number of vandalism edits in a corpus, which allows for the comparison of existing and new detection approaches. Using logistic regression we achieve 83% precision at 77% recall with our model. Compared to the rule-based methods that are urrently applied in Wikipedia, our approach increases the F-Measure performance by 49% while being faster at the same time.
|
vandalism, machine learning, wikipedia |
| Ivan Beschastnikh, Travis Kriplean, David W. McDonald |
Wikipedian Self-Governance in Action: Motivating the Policy Lens |
Proceedings of the Second International Conference on Weblogs and Social Media, AAAI, March 31, 2008 |
2008 |
[21] |
|
While previous studies have used the Wikipedia dataset to provide an understanding of its growth, there have been few attempts to quantitatively analyze the establishment and evolution of the rich social practices that support this editing community. One such social practice is the enactment and creation of Wikipedian policies. We focus on the enactment of policies in discussions on the talk pages that accompany each article. These policy citations are a valuable micro-to-macro connection between everyday action, communal norms and the governance structure of Wikipedia. We find that policies are widely used by registered users and administrators, that their use is converging and stabilizing in and across these groups, and that their use illustrates the growing importance of certain classes of work, in particular source attribution. We also find that participation in Wikipedias governance structure is inclusionary in practice.
|
policy use, governance, wikipedia |
| Andrea Forte, Amy Bruckman |
Scaling Consensus: Increasing Decentralization in Wikipedia Governance |
HICSS 2008, pp. 157-157. |
2008 |
[22] |
|
How does "self-governance" happen in Wikipedia? Through in-depth interviews with eleven individuals who have held a variety of responsibilities in the English Wikipedia, we obtained rich descriptions of how various forces produce and regulate social structures on the site. Our analysis describes Wikipedia as an organization with highly refined policies, norms, and a technological architecture that supports organizational ideals of consensus building and discussion. We describe how governance in the site is becoming increasingly decentralized as the community grows and how this is predicted by theories of commons-based governance developed in offline contexts. The trend of decentralization is noticeable with respect to both content-related decision making processes and social structures that regulate user behavior.
|
governance, wikipedia |
| Zareen Syed, Tim Finin, and Anupam Joshi |
Wikipedia as an Ontology for Describing Documents |
Proceedings of the Second International Conference on Weblogs and Social Media, AAAI, March 31, 2008 |
2008 |
[23] |
|
Identifying topics and concepts associated with a set of documents is a task common to many applications. It can help in the annotation and categorization of documents and be used to model a person's current interests for improving search results, business intelligence or selecting appropriate advertisements. One approach is to associate a document with a set of topics selected from a fixed ontology or vocabulary of terms. We have investigated using Wikipedia's articles and associated pages as a topic ontology for this purpose. The benefits are that the ontology terms are developed through a social process, maintained and kept current by the Wikipedia community, represent a consensus view, and have meaning that can be understood simply by reading the associated Wikipedia page. We use Wikipedia articles and the category and article link graphs to predict concepts common to a set of documents. We describe several algorithms to aggregate and refine results, including the use of spreading activation to select the most appropriate terms. While the Wikipedia category graph can be used to predict generalized concepts, the article links graph helps by predicting more specific concepts and concepts not in the category hierarchy. Our experiments demonstrate the feasibility of extending the category system with new concepts identified as a union of pages from the page link graph.
|
ontology, wikipedia, information retrieval, text classification |
| Felipe Ortega, Jesus M. Gonzalez-Barahona and Gregorio Robles |
On the Inequality of Contributions to Wikipedia |
HICSS 2008 |
2008 |
[24] |
Application of the Gini coefficient to measure the level of inequality of the contributions to the top ten language editions of Wikipedia. |
Wikipedia is one of the most successful examples of massive collaborative content development. However, many of the mechanisms and procedures that it uses are still unknown in detail. For instance, how equal (or unequal) are the contributions to it has been discussed in the last years, with no conclusive results. In this paper, we study exactly that aspect by using Lorenz curves and Gini coefficients, very well known instruments to economists. We analyze the trends in the inequality of distributions for the ten biggest language editions of Wikipedia, and their evolution over time. As a result, we have found large differences in the number of contributions by different authors (something also observed in free, open source software development), and a trend to stable patterns of inequality in the long run.
|
wikipedia |
| Anne-Marie Vercoustre, James A. Thom and Jovan Pehcevski |
Entity Ranking in Wikipedia |
SAC’08 March 16-20, 2008, Fortaleza, Ceara, Brazil |
2008 |
[25] |
Application of the Gini coefficient to measure the level of inequality of the contributions to the top ten language editions of Wikipedia. |
The traditional entity extraction problem lies in the ability of extracting named entities from plain text using natural language processing techniques and intensive training from large document collections. Examples of named entities include organisations, people, locations, or dates. There are many research activities involving named entities; we are interested in entity ranking in the field of information retrieval. In this paper, we describe our approach to identifying and ranking entities from the INEX Wikipedia document collection. Wikipedia offers a number of interesting features for entity identification and ranking that we first introduce. We then describe the principles and the architecture of our entity ranking system, and introduce our methodology for evaluation. Our preliminary results show that the use of categories and the link structure of Wikipedia, together with entity examples, can significantly improve retrieval effectiveness.
|
Entity Ranking, XML Retrieval, Test collection |
| Marek Meyer, Christoph Rensing, Ralf Steinmetz |
Categorizing Learning Objects Based On Wikipedia as Substitute Corpus |
First International Workshop on Learning Object Discovery & Exchange (LODE'07), September 18, 2007, Crete, Greece |
2007 |
[26] |
Usage of Wikipedia as corpus for machine learning methods. |
As metadata is often not sufficiently provided by authors of Learning Resources, automatic metadata generation methods are used to create metadata afterwards. One kind of metadata is categorization, particularly the partition of Learning Resources into distinct subject cat- egories. A disadvantage of state-of-the-art categorization methods is that they require corpora of sample Learning Resources. Unfortunately, large corpora of well-labeled Learning Resources are rare. This paper presents a new approach for the task of subject categorization of Learning Re- sources. Instead of using typical Learning Resources, the free encyclope- dia Wikipedia is applied as training corpus. The approach presented in this paper is to apply the k-Nearest-Neighbors method for comparing a Learning Resource to Wikipedia articles. Different parameters have been evaluated regarding their impact on the categorization performance.
|
Wikipedia, Categorization, Metadata, kNN, Classification, Substitute Corpus, Automatic Metadata Generation |
| Overell, Simon E., and Stefan Rüger |
Geographic co-occurrence as a tool for GIR. |
4th ACM workshop on Geographical Information Retrieval. Lisbon, Portugal. |
2007 |
[27] |
|
In this paper we describe the development of a geographic co-occurrence model and how it can be applied to geographic information retrieval. The model consists of mining co-occurrences of placenames from Wikipedia, and then mapping these placenames to locations in the Getty Thesaurus of Geographical Names. We begin by quantifying the accuracy of our model and compute theoretical bounds for the accuracy achievable when applied to placename disambiguation in free text. We conclude with a discussion of the improvement such a model could provide for placename disambiguation and geographic relevance ranking over traditional methods.
|
Wikipedia, disambiguation, geographic information retrieval |
| Torsten Zesch, Iryna Gurevych |
Analysis of the Wikipedia Category Graph for NLP Applications. |
Proceedings of the TextGraphs-2 Workshop (NAACL-HLT) |
2007 |
[28] |
|
In this paper, we discuss two graphs in Wikipedia (i) the article graph, and (ii) the category graph. We perform a graphtheoretic analysis of the category graph, and show that it is a scale-free, small world graph like other well-known lexical semantic networks. We substantiate our findings by transferring semantic relatedness algorithms defined on WordNet to the Wikipedia category graph. To assess the usefulness of the category graph as an NLP resource, we analyze its coverage and the performance of the transferred semantic relatedness algorithms.
|
nlp, relatedness, semantic, wikipedia |
| Antonio Toral and Rafael Muñozh |
Towards a Named Entity Wordnet (NEWN) |
Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing (RANLP). Borovets (Bulgaria). pp. 604-608 . September 2007 |
2007 |
[29] |
poster? |
|
|
| Ulrik Brandes and Jürgen Lerner |
Visual Analysis of Controversy in User-generated Encyclopedias |
Proc. IEEE Symp. Visual Analytics Science and Technology (VAST ' 07), to appear. |
2007 |
[30] |
|
Wikipedia is a large and rapidly growing Web-based collaborative authoring environment, where anyone on the Internet can create, modify, and delete pages about encyclopedic topics. A remarkable property of some Wikipedia pages is that they are written by up to thousands of authors who may have contradicting opinions. In this paper we show that a visual analysis of the “who revises whom”- network gives deep insight into controversies. We propose a set of analysis and visualization techniques that reveal the dominant authors of a page, the roles they play, and the alters they confront. Thereby we provide tools to understand howWikipedia authors collaborate in the presence of controversy.
|
social network controversy editing visualisation wikipedia |
| V Jijkoun, M de Rijke |
WiQA: Evaluating Multi-lingual Focused Access to Wikipedia |
Proceedings EVIA, 2007 |
2007 |
[31] |
|
We describe our experience with WiQA 2006, a pilot task aimed at studying question answering using Wikipedia. Going beyond traditional factoid questions, the task considered at WiQA 2006 was to identify—given an source article from Wikipedia—snippets from other Wikipedia articles, possibly in languages different from the language of the source article, that add new and important information to the source article, and that do so without repetition. A total of 7 teams took part, submitting 20 runs. Our main findings are two-fold: (i) while challenging, the tasks considered at WiQA are do-able as participants achieved precision@10 scores in the .5 range and MRR scores upwards of .5; (ii) on the bilingual task, substantially higher scores were achieved than on the monolingual tasks.
|
|
| Martin Potthast |
Wikipedia in the pocket: indexing technology for near-duplicate detection and high similarity search |
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval |
2007 |
[32] |
|
We develop and implement a new indexing technology which allows us to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Our approach aims at retrieval tasks such as near duplicate detection and high similarity search. To demonstrate the performance of our technology we have compiled the search index "Wikipedia in the Pocket", which contains about 2 million English and German Wikipedia articles.1 This index--along with a search interface--fits on a conventional CD (0.7 gigabyte). The ingredients of our indexing technology are similarity hashing and minimal perfect hashing.
|
wikipedia |
| Minier, Zsolt Bodo, Zalan Csato, Lehel |
Wikipedia-Based Kernels for Text Categorization |
Symbolic and Numeric Algorithms for Scientific Computing, 2007. SYNASC. International Symposium on |
2007 |
[33] |
|
In recent years several models have been proposed for text categorization. Within this, one of the widely applied models is the vector space model (VSM), where independence between indexing terms, usually words, is assumed. Since training corpora sizes are relatively small compared to what would be required for a realistic number of words the generalization power of the learning algorithms is low. It is assumed that a bigger text corpus can boost the representation and hence the learning process. Based on the work of Gabrilovich and Markovitch [6], we incorporate Wikipedia articles into the system to give word distributional representation for documents. The extension with this new corpus causes dimensionality increase, therefore clustering of features is needed. We use Latent Semantic Analysis (LSA), Kernel Principal Component Analysis (KPCA) and Kernel Canonical Correlation Analysis (KCCA) and present results for these experiments on the Reuters corpus.
|
|
| Thomas, Christopher Sheth, Amit P. |
Semantic Convergence of Wikipedia Articles |
Web Intelligence, IEEE/WIC/ACM International Conference on |
2007 |
[34] |
|
Social networking, distributed problem solving and human computation have gained high visibility. Wikipedia is a well established service that incorporates aspects of these three fields of research. For this reason it is a good object of study for determining quality of solutions in a social setting that is open, completely distributed, bottom up and not peer reviewed by certified experts. In particular, this paper aims at identifying semantic convergence of Wikipedia articles; the notion that the content of an article stays stable regardless of continuing edits. This could lead to an automatic recommendation of good article tags but also add to the usability of Wikipedia as a Web Service and to its reliability for information extraction. The methods used and the results obtained in this research can be generalized to other communities that iteratively produce textual content.
|
|
| Rada Mihalcea |
Using Wikipedia for Automatic Word Sense Disambiguation |
Proceedings of NAACL HLT, 2007 |
2007 |
[35] |
|
This paper describes a method for generating sense-tagged data using Wikipedia as a source of sense annotations. Through word sense disambiguation experiments, we show that the Wikipedia-based sense annotations are reliable and can be used to construct accurate sense classifiers.
|
|
| J Yu, JA Thom, A Tam |
Ontology evaluation using wikipedia categories for browsing |
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management |
2007 |
[36] |
|
Ontology evaluation is a maturing discipline with methodologies and measures being developed and proposed. However, evaluation methods that have been proposed have not been applied to specific examples. In this paper, we present the state-of-the-art in ontology evaluation - current methodologies, criteria and measures, analyse appropriate evaluations that are important to our application - browsing in Wikipedia, and apply these evaluations in the context of ontologies with varied properties. Specifically, we seek to evaluate ontologies based on categories found in Wikipedia.
|
browsing, ontology evaluation, user studies, wikipedia |
| Martin Wattenberg, Fernanda B. Viégas and Katherine Hollenbach |
Visualizing Activity on Wikipedia with Chromograms |
Human-Computer Interaction – INTERACT 2007 |
2007 |
[37] |
|
To investigate how participants in peer production systems allocate their time, we examine editing activity on Wikipedia, the well-known online encyclopedia. To analyze the huge edit histories of the site’s administrators we introduce a visualization technique, the chromogram, that can display very long textual sequences through a simple color coding scheme. Using chromograms we describe a set of characteristic editing patterns. In addition to confirming known patterns, such reacting to vandalism events, we identify a distinct class of organized systematic activities. We discuss how both reactive and systematic strategies shed light on self-allocation of effort in Wikipedia, and how they may pertain to other peer-production systems.
|
Wikipedia - Visualization - Peer Production - Visualization |
| A Kittur, E Chi, BA Pendleton, B Suh, T Mytkowicz |
Power of the Few vs. Wisdom of the Crowd: Wikipedia and the Rise of the Bourgeoisie |
25th Annual ACM Conference on Human Factors in Computing Systems (CHI 2007); 2007 April 28 - May 3; San Jose; CA. |
2007 |
[38] |
|
Wikipedia has been a resounding success story as a collaborative system with a low cost of online participation. However, it is an open question whether the success of Wikipedia results from a “wisdom of crowds” type of effect in which a large number of people each make a small number of edits, or whether it is driven by a core group of “elite” users who do the lion’s share of the work. In this study we examined how the influence of “elite” vs. “common” users changed over time in Wikipedia. The results suggest that although Wikipedia was driven by the influence of “elite” users early on, more recently there has been a dramatic shift in workload to the “common” user. We also show the same shift in del.icio.us, a very different type of social collaborative knowledge system. We discuss how these results mirror the dynamics found in more traditional social collectives, and how they can influence the design of new collaborative knowledge systems.
|
Wikipedia, Wiki, collaboration, collaborative knowledge systems, social tagging, delicious. |
| Meiqun Hu, Ee-Peng Lim, Aixin Sun, Hady W Lauw, Ba-Quy Vuong |
On improving wikipedia search using article quality |
WIDM '07: Proceedings of the 9th annual ACM international workshop on Web information and data management |
2007 |
[39] |
|
Wikipedia is presently the largest free-and-open online encyclopedia collaboratively edited and maintained by volunteers. While Wikipedia offers full-text search to its users, the accuracy of its relevance-based search can be compromised by poor quality articles edited by non-experts and inexperienced contributors. In this paper, we propose a framework that re-ranks Wikipedia search results considering article quality. We develop two quality measurement models, namely Basic and Peer Review, to derive article quality based on co-authoring data gathered from articles' edit history. Compared withWikipedia's full-text search engine, Google and Wikiseek, our experimental results showed that (i) quality-only ranking produced by Peer Review gives comparable performance to that of Wikipedia and Wikiseek; (ii) Peer Review combined with relevance ranking outperforms Wikipedia's full-text search significantly, delivering search accuracy comparable to Google.
|
quality, wikipedia |
| Wilkinson, Dennis M. and Huberman, Bernardo A. |
Cooperation and quality in wikipedia |
WikiSym '07: Proceedings of the 2007 international symposium on Wikis. |
2007 |
[40] |
|
The rise of the Internet has enabled collaboration and cooperation on anunprecedentedly large scale. The online encyclopedia Wikipedia, which presently comprises 7.2 million articles created by 7.04 million distinct editors, provides a consummate example. We examined all 50 million edits made tothe 1.5 million English-language Wikipedia articles and found that the high-quality articles are distinguished by a marked increase in number of edits, number of editors, and intensity of cooperative behavior, as compared to other articles of similar visibility and age. This is significant because in other domains, fruitful cooperation has proven to be difficult to sustain as the size of the collaboration increases. Furthermore, in spite of the vagaries of human behavior, we show that Wikipedia articles accrete edits according to a simple stochastic mechanism in which edits beget edits. Topics of high interest or relevance are thus naturally brought to the forefront of quality.
|
Wikipedia, collaborative authoring, cooperation, groupware |
| DPT Nguyen, Y Matsuo, M Ishizuka |
Subtree Mining for Relation Extraction from Wikipedia |
Proc. of NAACL/HLT 2007 |
2007 |
[41] |
|
In this study, we address the problem of extracting relations between entities fromWikipedia’s English articles. Our proposed method first anchors the appearance of entities in Wikipedia’s articles using neither Named Entity Recognizer (NER) nor coreference resolution tool. It then classifies the relationships between entity pairs using SVM with features extracted from the web structure and subtrees mined from the syntactic structure of text. We evaluate our method on manually annotated data from actual Wikipedia articles.
|
|
| Bongwon Suh, Ed H Chi, Bryan A Pendleton, Aniket Kittur |
Us vs. Them: Understanding Social Dynamics in Wikipedia with Revert Graph Visualizations |
Visual Analytics Science and Technology, 2007. VAST 2007. IEEE Symposium on (2007), pp. 163-170. |
2007 |
[42] |
|
Wikipedia is a wiki-based encyclopedia that has become one of the most popular collaborative on-line knowledge systems. As in any large collaborative system, as Wikipedia has grown, conflicts and coordination costs have increased dramatically. Visual analytic tools provide a mechanism for addressing these issues by enabling users to more quickly and effectively make sense of the status of a collaborative environment. In this paper we describe a model for identifying patterns of conflicts in Wikipedia articles. The model relies on users' editing history and the relationships between user edits, especially revisions that void previous edits, known as "reverts". Based on this model, we constructed Revert Graph, a tool that visualizes the overall conflict patterns between groups of users. It enables visual analysis of opinion groups and rapid interactive exploration of those relationships via detail drill-downs. We present user patterns and case studies that show the effectiveness of these techniques, and discuss how they could generalize to other systems.
|
motivation, social-network, wikipedia |
| Kittur, Aniket and Suh, Bongwon and Pendleton, Bryan A. and Chi, Ed H. |
He says, she says: conflict and coordination in Wikipedia |
CHI '07: Proceedings of the SIGCHI conference on Human factors in computing systems |
2007 |
[43] |
|
Wikipedia, a wiki-based encyclopedia, has become one of the most successful experiments in collaborative knowledge building on the Internet. As Wikipedia continues to grow, the potential for conflict and the need for coordination increase as well. This article examines the growth of such non-direct work and describes the development of tools to characterize conflict and coordination costs in Wikipedia. The results may inform the design of new collaborative knowledge systems.
|
Wiki, Wikipedia, collaboration, conflict, user model, visualization, web-based interaction |
| Davide Buscaldi and Paolo Rosso |
A Comparison of Methods for the Automatic Identification of Locations in Wikipedia |
Proceedings of GIR’07 |
2007 |
[44] |
|
In this paper we compare two methods for the automatic identification of geographical articles in encyclopedic resources such asWikipedia. The methods are aWordNet-basedmethod that uses a set of keywords related to geographical places, and a multinomial Na¨ıve Bayes classificator, trained over a randomly selected subset of the English Wikipedia. This task may be included into the broader task of Named Entity classification, a well-known problem in the field of Natural Language Processing. The experiments were carried out considering both the full text of the articles and only the definition of the entity being described in the article. The obtained results show that the information contained in the page templates and the category labels is more useful than the text of the articles.
|
Algorithms, Measurement, Performance, text analysis, language models |
| Li, Yinghao and Wing and Kei and Fu |
Improving weak ad-hoc queries using wikipedia asexternal corpus |
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval |
2007 |
[45] |
|
In an ad-hoc retrieval task, the query is usually short and the user expects to find the relevant documents in the first several result pages. We explored the possibilities of using Wikipedia's articles as an external corpus to expand ad-hoc queries. Results show promising improvements over measures that emphasize on weak queries.
|
Wikipedia, external corpus, pseudo-relevance feedback |
| Y Watanabe, M Asahara, Y Matsumoto |
A Graph-based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields |
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) |
2007 |
[46] |
|
This paper presents a method for categorizing named entities in Wikipedia. In Wikipedia, an anchor text is glossed in a linked HTML text. We formalize named entity categorization as a task of categorizing anchor texts with linked HTML texts which glosses a named entity. Using this representation, we introduce a graph structure in which anchor texts are regarded as nodes. In order to incorporate HTML structure on the graph, three types of cliques are defined based on the HTML tree structure. We propose a method with Conditional Random Fields (CRFs) to categorize the nodes on the graph. Since the defined graph may include cycles, the exact inference of CRFs is computationally expensive. We introduce an approximate inference method using Treebased Reparameterization (TRP) to reduce computational cost. In experiments, our proposed model obtained significant improvements compare to baseline models that use Support Vector Machines.
|
|
| Simone Braun and Andreas Schmidt |
Wikis as a Technology Fostering Knowledge Maturing: What we can learn from Wikipedia |
7th International Conference on Knowledge Management (IKNOW '07),Special Track on Integrating Working and Learning in Business (IWL), 2007. |
2007 |
[47] |
|
The knowledge maturing theory opens an important macro perspective within the new paradigm of work-integrated learning. Especially wikis are interesting socio-technical systems to foster maturing activities by overcoming typical barriers. But so far, the theory has been mainly based on anecdotal evidence collected from various projects and observations. In this paper, we want to present the results of a qualitative and quantitative study of Wikipedia with respect to maturing phenomena, identifying instruments and measures indicating maturity. The findings, generalized to enterprise wikis, open the perspective on what promotes maturing on a method level and what can be used to spot maturing processes on a technology level.
|
knowledge management wiki wikipedia |
| Linyun Fu and Haofen Wang and Haiping Zhu and Huajie Zhang and Yang Wang and Yong Yu |
Making More Wikipedians: Facilitating Semantics Reuse for Wikipedia Authoring |
Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC2007), Busan, South Korea, 4825: 127--140, 2007. |
2007 |
[48] |
|
Wikipedia, a killer application in Web 2.0, has embraced the power of collaborative editing to harness collective intelligence. It can also serve as an ideal Semantic Web data source due to its abundance, influence, high quality and well-structuring. However, the heavy burden of up-building and maintaining such an enormous and ever-growing online encyclopedic knowledge base still rests on a very small group of people. Many casual users may still feel difficulties in writing high quality Wikipedia articles. In this paper, we use RDF graphs to model the key elements in Wikipedia authoring, and propose an integrated solution to make Wikipedia authoring easier based on RDF graph matching, expecting making more Wikipedians. Our solution facilitates semantics reuse and provides users with: 1) a link suggestion module that suggests and auto-completes internal links between Wikipedia articles for the user; 2) a category suggestion module that helps the user place her articles in correct categories. A prototype system is implemented and experimental results show significant improvements over existing solutions to link and category suggestion tasks. The proposed enhancements can be applied to attract more contributors and relieve the burden of professional editors, thus enhancing the current Wikipedia to make it an even better Semantic Web data source.
|
semanticWeb web2.0 wikipedia |
| Sören Auer and Chris Bizer and Jens Lehmann and Georgi Kobilarov and Richard Cyganiak and Zachary Ives |
DBpedia: A Nucleus for a Web of Open Data |
Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC2007), Busan, South Korea, 4825: 715--728, 2007. |
2007 |
[49] |
|
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against datasets derived from Wikipedia and to link other datasets on the Web to Wikipedia data. We describe the extraction of the DBpedia datasets, and how the resulting information can be made available on the Web for humans and machines. We describe some emerging applications from the DBpedia community and show how website operators can reduce costs by facilitating royalty-free DBpedia content within their sites. Finally, we present the current status of interlinking DBpedia with other open datasets on the Web and outline how DBpedia could serve as a nucleus for an emerging Web of open data sources.
|
information retrieval mashup semantic Web wikipedia |
| Simone P. Ponzetto and Michael Strube |
An API for Measuring the Relatedness of Words in Wikipedia |
Companion Volume to the Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, : 23--30, 2007. |
2007 |
[50] |
|
We present an API for computing the semantic relatedness of words in Wikipedia.
|
api, relatedness semantic\_web, sematic, wikipedia |
| Ponzetto, Simone P. and Strube, Michael |
Deriving a Large Scale Taxonomy from Wikipedia |
Proceedings of the 22nd National Conference on Artificial Intelligence, Vancouver, B.C., 22-26 July |
2007 |
[51] |
|
We take the category system in Wikipedia as a conceptual network. We label the semantic relations between categories us- ing methods based on connectivity in the network and lexico- syntactic matching. As a result we are able to derive a large scale taxonomy containing a large amount of subsumption, i.e. isa, relations. We evaluate the quality of the created resource by comparing it with ResearchCyc, one of the largest manually annotated ontologies, as well as computing seman- tic similarity between words in benchmarking datasets. able to derive a large scale taxonomy.
|
api, relatedness semantic web, sematic, wikipedia |
| Simone Paolo Ponzetto |
Creating a Knowledge Base from a Collaboratively Generated Encyclopedia |
Proceedings of the NAACL-HLT 2007 Doctoral Consortium, pp 9-12, Rochester, NY, April 2007 |
2007 |
[52] |
|
We present our work on using Wikipedia as a knowledge source for Natural Language Processing. We first describe our previous work on computing semantic relatedness from Wikipedia, and its application to a machine learning based coreference resolution system. Our results suggest that Wikipedia represents a semantic resource to be treasured for NLP applications, and accordingly present the work directions to be explored in the future.
|
|
| Ralf Schenkel, Fabian Suchanek and Gjergji Kasneci |
YAWN: A Semantically Annotated Wikipedia XML Corpus |
BTW2007 |
2007 |
[53] |
|
The paper presents YAWN, a system to convert the well-known and widely used Wikipedia collection into an XML corpus with semantically rich, self-explaining tags. We introduce algorithms to annotate pages and links with concepts from the WordNet thesaurus. This annotation process exploits categorical information in Wikipedia, which is a high-quality, manually assigned source of information, extracts additional information from lists, and utilizes the invocations of templates with named parameters. We give examples how such annotations can be exploited for high-precision queries.
|
|
| Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita & Giuseppe Attardi |
Ranking Very Many Typed Entities on Wikipedia |
CIKM ‘07: Proceedings of the Sixteenth ACM International Conference on Information and Knowledge Management |
2007 |
[54] |
|
We discuss the problem of ranking very many entities of different types. In particular we deal with a heterogeneous set of types, some being very generic and some very speci�c. We discuss two approaches for this problem: i) exploiting the entity containment graph and ii) using a Web search engine to compute entity relevance. We evaluate these approaches on the real task of ranking Wikipedia entities typed with a state-of-the-art named-entity tagger. Results show that both approaches can greatly increase the performance of methods based only on passage retrieval.
|
|
| Sören Auer and Jens Lehmann |
What Have Innsbruck and Leipzig in Common? Extracting Semantics from Wiki Content |
Proceedings of 4th European Semantic Web Conference; published in The Semantic Web: Research and Applications, pages 503-517 |
2007 |
[55] |
|
WWikis are established means for the collaborative authoring, versioning and publishing of textual articles. The Wikipedia project, for example, succeeded in creating the by far largest encyclopedia just on the basis of a wiki. Recently, several approaches have been proposed on how to extend wikis to allow the creation of structured and semantically enriched content. However, the means for creating semantically enriched structured content are already available and are, although unconsciously, even used by Wikipedia authors. In this article, we present a method for revealing this structured content by extracting information from template instances. We suggest ways to efficiently query the vast amount of extracted information (e.g. more than 8 million RDF statements for the English Wikipedia version alone), leading to astonishing query answering possibilities (such as for the title question). We analyze the quality of the extracted content, and propose strategies for quality improvements with just minor modifications of the wiki systems being currently used.
|
|
| George Bragues |
Wiki-Philosophizing in a Marketplace of Ideas: Evaluating Wikipedia's Entries on Seven Great Minds |
Social Science Research Network Working Paper Series (April 2007) |
2007 |
[56] |
|
A very conspicuous part of the new participatory media, Wikipedia has emerged as the Internet's leading source of all-purpose information, the volume and range of its articles far surpassing that of its traditional rival, the Encyclopedia Britannica. This has been accomplished by permitting virtually anyone to contribute, either by writing an original article or editing an existing one. With almost no entry barriers to the production of information, the result is that Wikipedia exhibits a perfectly competitive marketplace of ideas. It has often been argued that such a marketplace is the best guarantee that quality information will be generated and disseminated. We test this contention by examining Wikipedia's entries on seven top Western philosophers. These entries are evaluated against the consensus view elicited from four academic reference works in philosophy. Wikipedia's performance turns out to be decidedly mixed. Its average coverage rate of consensus topics is 52%, while the median rate is 56%. A qualitative analysis uncovered no outright errors, though there were significant omissions. The online encyclopedia's harnessing of the marketplace of ideas, though not unimpressive, fails to emerge as clearly superior to the traditional alternative of relying on individual expertise for information.
|
quality, wikipedia |
| Gang Wang and Yong Yu and Haiping Zhu |
PORE: Positive-Only Relation Extraction from Wikipedia Text |
Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC2007), Busan, South Korea |
2007 |
[57] |
|
Extracting semantic relations is of great importance for the creation of the Semantic Web content. It is of great benefit to semi-automatically extract relations from the free text of Wikipedia using the structured content readily available in it. Pattern matching methods that employ information redundancy cannot work well since there is not much redundancy information in Wikipedia, compared to the Web. Multi-class classification methods are not reasonable since no classification of relation types is available in Wikipedia. In this paper, we propose PORE (Positive-Only Relation Extraction), for relation extraction from Wikipedia text. The core algorithm B-POL extends a state-of-the-art positive-only learning algorithm using bootstrapping, strong negative identification, and transductive inference to work with fewer positive training examples. We conducted experiments on several relations with different amount of training data. The experimental results show that B-POL can work effectively given only a small amount of positive training examples and it significantly outperforms the original positive learning approaches and a multi-class SVM. Furthermore, although PORE is applied in the context of Wikipedia, the core algorithm B-POL is a general approach for Ontology Population and can be adapted to other domains.
|
annotation iswc, knowledge-extraction nlp semantic-web text-mining wikipedia |
| Fei Wu, Daniel S. Weld |
Autonomously semantifying wikipedia |
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management |
2007 |
[58] |
The Intelligence in Wikipedia Project at University of Washington |
Berners-Lee's compelling vision of a Semantic Web is hindered by a chicken-and-egg problem, which can be best solved by a bootstrapping method - creating enough structured data to motivate the development of applications. This paper argues that autonomously "Semantifying Wikipedia" is the best way to solve the problem. We choose Wikipedia as an initial data source, because it is comprehensive, not too large, high-quality, and contains enough manually-derived structure to bootstrap an autonomous, self-supervised process. We identify several types of structures which can be automatically enhanced in Wikipedia (e.g., link structure, taxonomic data, infoboxes, etc.), and we describea prototype implementation of a self-supervised, machine learning system which realizes our vision. Preliminary experiments demonstrate the high precision of our system's extracted data - in one case equaling that of humans.
|
Information Extraction, Wikipedia, Semantic Web |
| Viégas, Fernanda |
The Visual Side of Wikipedia |
System Sciences, 2007. HICSS 2007. 40th Annual Hawaii International Conference on |
2007 |
[59] |
|
Critical social theorists often emphasize the control and surveillance aspects of information systems, building upon a characterization of information technology as a tool for increased rationalization. The emancipatory potential of information systems is often overlooked. In this paper, we apply the Habermasian ideal of rational discourse to Wikipedia as an illustration of the emancipatory potential of information systems. We conclude that Wikipedia does embody an approximation of rational discourse, while several challenges remain
|
|
| Sean Hansen Nicholas Berente Kalle Lyytinen |
Wikipedia as Rational Discourse: An Illustration of the Emancipatory Potential of Information Systems |
Proceedings of Hawaiian International Conference of Systems Sciences Big Island, Hawaii.) |
2007 |
[60] |
|
The name “Wikipedia” has been associated with terms such as collaboration, volunteers, reliability, vandalism, and edit-war. Fewer people might think of “images,” “maps,” “diagrams,” “illustrations” in this context. This paper presents the burgeoning but underexplored visual side of the online encyclopedia. A survey conducted with image contributors to Wikipedia reveals key differences in collaborating around images as opposed to text. The results suggest that, even though image editing is a more isolated activity, somewhat shielded from vandalism, the sense of community is an important motivation for image contributors. By examining how contributors are appropriating text-oriented wiki technology to support collective editing around visual materials, this paper reveals the potential and some of the limitations of wikis in the realm of visual collaboration.
|
|
| Fissaha Adafre, Sisay, Jijkoun, Valentin, de Rijke, Maarten |
Fact Discovery in Wikipedia |
Web Intelligence, IEEE/WIC/ACM International Conference on |
2007 |
[61] |
|
We address the task of extracting focused salient information items, relevant and important for a given topic, from a large encyclopedic resource. Specifically, for a given topic (a Wikipedia article) we identify snippets from other articles in Wikipedia that contain important information for the topic of the original article, without duplicates. We compare several methods for addressing the task, and find that a mixture of content-based, link-based, and layout-based features outperforms other methods, especially in combination with the use of so-called reference corpora that capture the key properties of entities of a common type.
|
nlp, relatedness, semantic, wikipedia |
| Li, Bing Chen, Qing-Cai Yeung, Daniel S. Ng, Wing W.Y. Wang, Xiao-Long |
Exploring Wikipedia and Query Log's Ability for Text Feature Representation |
Machine Learning and Cybernetics, 2007 International Conference on |
2007 |
[62] |
|
The rapid increase of internet technology requires a better management of web page contents. Many text mining researches has been conducted, like text categorization, information retrieval, text clustering. When machine learning methods or statistical models are applied to such a large scale of data, the first step we have to solve is to represent a text document into the way that computers could handle. Traditionally, single words are always employed as features in Vector Space Model, which make up the feature space for all text documents. The single-word based representation is based on the word independence and doesn't consider their relations, which may cause information missing. This paper proposes Wiki-Query segmented features to text classification, in hopes of better using the text information. The experiment results show that a much better F1 value has been achieved than that of classical single-word based text representation. This means that Wikipedia and query segmented feature could better represent a text document.
|
|
| Wei Che Huang, Andrew Trotman, and Shlomo Geva |
Collaborative Knowledge Management: Evaluation of Automated Link Discovery in the Wikipedia |
SIGIR 2007 Workshop on Focused Retrieval, July 27, 2007, Amsterdam |
2007 |
[63] |
|
Using the Wikipedia as a corpus, the Link-the-Wiki track, launched by INEX in 2007, aims at producing a standard procedure and metrics for the evaluation of (automated) link discovery at different element levels. In this paper, we describe the preliminary procedure for the assessment, including the topic selection, submission, pooling and evaluation. Related techniques are also presented such as the proposed DTD, submission format, XML element retrieval and the concept of Best Entry Points (BEPs). Due to the task required by LTW, it represents a considerable evaluation challenge. We propose a preliminary procedure of assessment for this stage of the LTW and also discuss the further issues for improvement. Finally, an efficiency measurement is introduced for investigation since the LTW task involves two studies: the selection of document elements that represent the topic of request and the nomination of associated links that can access different levels of the XML document.
|
Wikipedia, Link-the-Wiki, INEX, Evaluation, DTD, Best Entry Point |
| Morten Rask |
The Richness and Reach of Wikinomics: Is the Free Web-Based Encyclopedia Wikipedia Only for the Rich Countries? |
Proceedings of the Joint Conference of The International Society of Marketing Development and the Macromarketing Society, June 2-5, 2007 |
2007 |
[64] |
|
In this paper, a model of the patterns of correlation in Wikipedia, reach and richness, lays the foundation for studying whether or not the free web-based encyclopedia Wikipedia is only for developed countries. Wikipedia is used in this paper, as an illustrative case study for the enormous rise of the so-called Web 2.0 applications, a subject which has become associated with many golden promises: Instead of being at the outskirts of the global economy, the development of free or low-cost internet-based content and applications, makes it possible for poor, emerging, and transition countries to compete and collaborate on the same level as developed countries. Based upon data from 12 different Wikipedia language editions, we find that the central structural effect is on the level of human development in the current country. In other words, Wikipedia is in general, more for rich countries than for less developed countries. It is suggested that policy makers make investments in increasing the general level of literacy, education, and standard of living in their country. The main managerial implication for businesses, that will expand their social network applications to other countries, is to use the model of the patterns of correlation in Wikipedia, reach and richness, as a market screening and monitoring model.
|
Digital divide, Developing countries, Internet, Web 2.0, Social networks, Reach and richness, Wikipedia, Wikinomics, culture, language |
| Kotaro Nakayama, Takahiro Hara, Sojiro Nishio |
A Thesaurus Construction Method from Large Scale Web Dictionaries |
21st IEEE International Conference on Advanced Information Networking and Applications (AINA) |
2007 |
[65] |
Wikipedia-Lab work
|
Web-based dictionaries, such as Wikipedia, have become dramatically popular among the internet users in past several years. The important characteristic of Web-based dictionary is not only the huge amount of articles, but also hyperlinks. Hyperlinks have various information more than just providing transfer function between pages. In this paper, we propose an efficient method to analyze the link structure of Web-based dictionaries to construct an association thesaurus. We have already applied it to Wikipedia, a huge scale Web-based dictionary which has a dense link structure, as a corpus. We developed a search engine for evaluation, then conducted a number of experiments to compare our method with other traditional methods such as co-occurrence analysis.
|
Wikipedia Mining, Association Thesaurus, Link Structure Analysis, Link Text, Synonyms |
| Sergio Ferrández, Antonio Toral, Óscar Ferrández, Antonio Ferrández and Rafael Muñoz |
Applying Wikipedia’s Multilingual Knowledge to Cross–Lingual Question Answering |
Lecture Notes in Computer Science |
2007 |
[66] |
|
The application of the multilingual knowledge encoded in Wikipedia to an open–domain Cross–Lingual Question Answering system based on the Inter Lingual Index (ILI) module of EuroWordNet is proposed and evaluated. This strategy overcomes the problems due to ILI’s low coverage on proper nouns (Named Entities). Moreover, as these are open class words (highly changing), using a community–based up–to–date resource avoids the tedious maintenance of hand–coded bilingual dictionaries. A study reveals the importance to translate Named Entities in CL–QA and the advantages of relying on Wikipedia over ILI for doing this. Tests on questions from the Cross–Language Evaluation Forum (CLEF) justify our approach (20% of these are correctly answered thanks to Wikipedia’s Multilingual Knowledge).
|
|
| G Urdaneta, G Pierre, M van Steen |
A Decentralized Wiki Engine for Collaborative Wikipedia Hosting |
3rd International Conference on Web Information Systems and Technology (WEBIST), March 2007 |
2007 |
[67] |
|
This paper presents the design of a decentralized system for hosting large-scale wiki web sites like Wikipedia, using a collaborative approach. Our design focuses on distributing the pages that compose the wiki across a network of nodes provided by individuals and organizations willing to collaborate in hosting the wiki. We present algorithms for placing the pages so that the capacity of the nodes is not exceeded and the load is balanced, and algorithms for routing client requests to the appropriate nodes. We also address fault tolerance and security issues.
|
|
| M Hu, EP Lim, A Sun, HW Lauw, BQ Vuong |
Measuring article quality in wikipedia: models and evaluation |
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management |
2007 |
|