Taming Text
The engineers guide to search and natural language processing
Book's source code is available at
https://github.com/tamingtext/book From
http://blog.isabel-drost.de/index.php/archives/453/on-taming-text Search can be as easy as providing one box in some corner on your web site that users can type into to find relevant pages. However when thinking about the topic just a little more some more handy features that users have come to expect come to mind:
- Type ahead to avoid superfluous typing - it also comes in handy to avoid spelling errors and to know exactly which query actually will return a decent number of documents.
- Spelling correction is pretty much standard - and avoids user frustration with hard to spell query terms.
- Facetting is a great way to discover and explore more content in particular when there are a few structured attributes attached to your items (prices to books, colors to cars etc).
- Named Entity Recognition is well known among publishers who use automatic tagging services to support their staff.
The authors of Taming Text decided to structure the book around the task of building an automatic Question Answering system. Throughout the book they present technologies that need to be orchestrated to build such an application but are each valuable in it’s own right.
In contrast to Search Patterns (which is focused mainly on the product manager perspective and contains much less technical detail) Taming Text is the book to read for any engineer working on search applications. In contrast to books like Programming Collective IngelligenceTaming Text takes you one level further by not only showing the tools to use but also explaining their inner workings so that you can adapt them exactly to your use case. To me, Taming Text is the ideal complimentary book to Mahout in Action (for the machine learning part) and Lucene in Action for the search part.
Back in 1998 it was estimated that 80% of all information is unstructured data. In order to make sense of that wealth of data we need technologies that can deal with unstructured data. Search is one of the most basic but also most powerful ways to analyse texts. With a good mixture of theoretical background and hands-on-examples Taming Text guides you through the process of building a successful search application, no matter if you are dealing with a vast product database that you want to make more accessible to your users, with an ever growing news archive or with several blog posts and twitter messages that you want to extract data from.