The last couple of years have been characterized by a massive increase in publication volume indexed in Pubmed and Pubmed Central. The freely accessible part of both resources contains more than 22 million documents in the beginning of 2012 which increases with a speed of two to three new entries every minute of the year. This exponentially increasing amount of published research articles makes it obvious that most of the knowledge contained in those articles is lost within the sea of innumerable publications because no useful knowledge management system is able to deal with those amounts of information in a time-efficient manner. For this reason, multiple text mining systems have been developed with varying degrees of useability, efficiency, and further applicability that attempt to automatically extract relevant knowledge from texts written by humans.
The text mining system, Excerbt, was developed at the Institute of Bioinformatics and Systems Biology at the Helmholtz Zentrum M¨unchen. While providing a proven solution to the problem of extracting relevant knowledge from texts by the use of a technique called semantic role labeling, it lacked efficient features in dealing with exponentially growing big data amounts of publications.
Within the scope of this thesis, I brought the text mining system Excerbt to a new level by combining its proven approaches with a completely new cloud-based hard- and software architecture and an according redevelopment of the core processes. Specifically, the new system now employs an approach based on the Hadoop MapReduce framework and the bigtable database HBase. This has several advantages regarding efficiency of knowledge extraction as well as fast ad-hoc analyses on extracted data with enormously reduced manual work.
The extracted data of the new Excerbt is now stored as a topic map in a graphical representation in HBase. The graph’s nodes and edges represent biomedical entities and their interactions as derived from literature. In the beginning of 2012, this graph contains approximately 600.000 nodes and 75 million edges. Results of ad-hoc analyses on this graph indicate that it exhibits the scale-free property as expected and often observed for organically grown networks.
Based on the knowledge stored in Excerbt I have further developed a method that is able to point
scientists to under- and over-speci1ed areas within a graph that are suggested for further research with a scoring system. These areas denote edges between nodes that either exist in the graph and are unlikely or do not exist but seem plausible because of the neighboring network structure. The approach is loosely based on locally clustered egocentric network analysis and assigns scores to all existing and hypothetical edges in a network based on their surroundings.
With these developments I have brought knowledge extraction to the next generation. The capability of the system of handling exponentially increasing data sets together with automated suggestion of research targets covers a broad range of knowledge extraction and management and opens up innumerable ways to perform further research automatically extracted big data knowledge.