Open Access News

News from the open access movement

Wednesday, May 28, 2008

Combining OA, wikis, community annotation, semantic processing, and text mining

Barend Mons and 22 co-authors, Calling on a Million Minds for Community Annotation in WikiProteins, Genome Biology, May 28, 2008.

Abstract: WikiProteins enables Community Annotation in an Open Access, Wiki-based system. Extracts of major data sources have been fused into an editable environment with a link out to the original sources. Data from Community edits take place on automatic copies of the original data . Semantic technology captures concepts co-occurring in one sentence and thus potential factual statements. The concepts are selected from authoritative ontologies or databases. In addition, indirect associations via concept profile matching have been calculated. We here call on a 'million minds' to annotate a 'million concepts' and collect new facts from full text literature with the immediate reward of collaborative knowledge discovery.

I've omitted the links from the abstract because they presuppose a technology I don't have on my blog, apparently the technology described in the article. To see it in action, surf over the article itself. Keywords are highlighted in different colors: blue for anatomy, yellow for genes and molecular sequences, green for living beings, and so on. (Hover your mouse over a colored keyword to see its category.) Clicking a keyword pops up a small window with a user-editable definition. The window also offers the options to run a search on the term or to look up its entry in WikiProfessional or its "knowlet" in the Concept Web. Unfortunately, users don't have the option to open the WikiPro or Concept Web entries in a new window, forcing us to leave the article we're trying to read. My copy of Windows XP wanted to run Microsoft's MSXML 5.0 in order to read the article, and I refused, so I may be missing some of its functionality.

From the Rationale and overview section of the paper (again without links):

This paper aims to explain an experimental system for Community Annotation and collaborative knowledge discovery called WikiProteins. The exploding number of papers abstracted in PubMed has prompted many attempts to capture information automatically from the literature and from primary data into computer readable, unambiguous format. When done manually and by dedicated experts, this process is frequently referred to as curation. The automated computational approach is broadly referred to as text mining....We propose here that a combination of text mining and subsequent community annotation of relationships between concepts in a collaborative environment is the way forward. The future outlook to integrate data mining (for instance gene co-expression data) with literature mining, as formulated in the review by Jensen et al, is at the core of what we aim for at the text mining/data mining interface. To support the capturing of qualitative as well as quantitative data of different nature into a light, flexible, and dynamic ontology format we developed a software component called Knowlets. The Knowlets combine multiple attributes and values for relationships between concepts. Scientific publications contain many re-iterations of factual statements. The Knowlet records relationships between two concepts only once....This approach results in a minimal growth of the Concept Space as compared to the text space....PubMed grew beyond 14,000,000 abstracts in 2006 (by the end of 2007 the 17,000,000 mark was passed). In 2006, UMLS contained well over 1,300,000 concepts. Only 185,262 concepts from UMLS were actually mentioned in PubMed (2006 version) and therefore the concept space of the entire PubMed corpus could be captured in just over 185,000 Knowlets. The first section of this article describes the WikiProteins application and rationale in general terms. The second section describes three user scenarios enabled by the current status of the Knowlet-based Wiki system. In the third section (provided as supplementary data) a more detailed technical description of the system is given.

From today's press release:

Today sees the launch of a new collaborative website initially focusing on proteins and their role in biology and medicine. The WikiProfessional technology underlying the site has been developed based upon the collaborative Wikipedia approach. Described in BioMed Central’s open access journal Genome Biology, WikiProteins provides a method for community annotation on a huge scale.

The article is written by Barend Mons of the Erasmus Medical Center in Rotterdam, and the Leiden University Medical Center...and his co-authors...include Amos Bairoch of UniProt, Michael Ashburner of GO and Jimmy Wales, the co-founder of Wikipedia.

The source material for WikiProteins comes from a mixture of existing authoritative databases (such as the Unified Medical Language System, UniProtKB/Swiss-Prot, IntAct and GO), supplemented by concepts mined from scientific papers published in public literature databases. The automated data mining identifies ‘facts’ in these available resources, such as protein functions or protein-disease relationships. This process created over one million biomedical concept clouds – called ‘Knowlets’ – around each individual concept. The developers of the site now hope that many researchers will follow their call to annotate, via WikiProteins, the Knowlets for which they are leading experts. The method enables researchers to add data even from sources that are not openly available, such as from journals only accessible via publishers’ databases, immensely enhancing the potential for comprehensive coverage. Each page of text called up via the system is automatically indexed and concepts are connected to the WikiSpace, so that their definition comes up and the information can be edited directly from the page.

The resulting data in the Wiki is fully and freely accessible to the public, and entries can be annotated by any registered user. Mons said: “We here call on a million minds to annotate a million concepts and collect new facts from full-text literature with the immediate reward of collaborative knowledge discovery and recognition of Wiki-contributions to the scientific community.”

PS: For background, see our earlier posts on WikiProteins, WikiProfessional, Knowlets, and Knewco (the company behind both WikiProfessional and Knowlets).

Update. Also see Jan Velterop's post on WikiProfessional. I'm new to the technology, but Jan is the CEO of KnewCo, the company behind it. Excerpt:

...The idea is that the combined efforts of a ‘million minds’ would be able, in a collaborative intelligence exercise, to refine a system that 'distills' the essence of established knowledge as well as points to new knowledge that has a high likelihood of being established soon....

The concept (so to speak) is so far optimized for the life sciences and medicine, but there is no reason why it shouldn’t work in other areas as well. And in languages other than English. It is based on concepts, and those are of course valid in any language. It’s just the words or descriptions used for them are different....

Just imagine what that means. One of the beauties of the concept approach (as opposed to the keyword approach) is that search terms in one language could, for instance, yield search results in another. Think of Chinese researchers searching with Chinese terms for English literature (they can read English, but may find it more difficult to come up with search terms in English, in the same way that I find it sometimes easier to search with Dutch terms), yet getting served up with English search results. Things like that. Wonderful....

Update. See Euan Adie's critical comments on WikiProteins and Barend Mons' response.

Posted by Peter Suber at 5/28/2008 08:07:00 AM.