Scalable Knowledge Harvesting with High Precision and High Recall Stelling1 #295376
|
|
+Citaten (1)
- CitatenVoeg citaat toeList by: CiterankMapLink[1] Scalable Knowledge Harvesting with High Precision and High Recall
Citerend uit: Ndapandula Nakashole, Martin Theobald, Gerhard Weikum Publication info: WSDM’11, February 9–12, 2011 Geciteerd door: Jack Park 7:00 PM 6 November 2013 GMT URL:
|
Fragment- Harvesting relational facts from Web sources has received great attention for automatically constructing large knowledge bases. Stateof- the-art approaches combine pattern-based gathering of fact candidates with constraint-based reasoning. However, they still face major challenges regarding the trade-offs between precision, recall, and scalability. Techniques that scale well are susceptible to noisy patterns that degrade precision, while techniques that employ deep reasoning for high precision cannot cope with Web-scale data. This paper presents a scalable system, called PROSPERA, for high-quality knowledge harvesting. We propose a new notion of ngram- itemsets for richer patterns, and use MaxSat-based constraint reasoning on both the quality of patterns and the validity of fact candidates.We compute pattern-occurrence statistics for two benefits: they serve to prune the hypotheses space and to derive informative weights of clauses for the reasoner. The paper shows how to incorporate these building blocks into a scalable architecture that can parallelize all phases on a Hadoop-based distributed platform. Our experiments with the ClueWeb09 corpus include comparisons to the recent ReadTheWeb experiment. We substantially outperform these prior results in terms of recall, with the same precision, while having low run-times. |