A concept search (or conceptual search) is an automated information retrieval method that is used to search electronically stored unstructured text (for example, digital archives, email, scientific literature, etc.) for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.
[...]

Continue

Concept search techniques were developed because of limitations imposed by classical Boolean keyword search technologies when dealing with large, unstructured digital collections of text. Keyword searches often return results that include many non-relevant items (false positives) or that exclude too many relevant items (false negatives) because of the effects of synonymy and polysemy. Synonymy means that one of two or more words in the same language have the same meaning, and polysemy means that many individual words have more than one meaning.
Polysemy is a major obstacle for all computer systems that attempt to deal with human language. In English, most frequently used terms have several common meanings. For example, the word fire can mean: a combustion activity; to terminate employment; to launch, or to excite (as in fire up). For the 200 most-polysemous terms in English, the typical verb has more than twelve common meanings, or senses. The typical noun from this set has more than eight common senses. For the 2000 most-polysemous terms in English, the typical verb has more than eight common senses and the typical noun has more than five.
In addition to the problems of polysemy and synonymy, keyword searches can exclude inadvertently misspelled words as well as the variations on the stems (or roots) of words (for example, strike vs. striking). Keyword searches are also susceptible to errors introduced by optical character recognition (OCR) scanning processes, which can introduce random errors into the text of documents (often referred to as noisy text)during the scanning process.
A concept search can overcome these challenges by employing word sense disambiguation (WSD), and other techniques, to help it derive the actual meanings of the words, and their underlying concepts, rather than by simply matching character strings like keyword search technologies.
[...]

Continue

In general, information retrieval research and technology can be divided into two broad categories: semantic and statistical. Information retrieval systems that fall into the semantic category will attempt to implement some degree of syntactic and semantic analysis of the natural language text that a human user would provide (also see computational linguistics). Systems that fall into the statistical category will find results based on statistical measures of how closely they match the query. It must be noted, however, that systems in the semantic category also often rely on statistical methods to help them find and retrieve information.
Efforts to provide information retrieval systems with semantic processing capabilities have basically used three different approaches:
  • Auxiliary structures
  • Local co-occurrence statistics
  • Transform techniques (particularly matrix decompositions)

Auxiliary Structures

A variety of techniques based on Artificial Intelligence (AI) and Natural Language Processing (NLP) have been applied to semantic processing, and most of them have relied on the use of auxiliary structures such as controlled vocabularies and ontologies. Controlled vocabularies (dictionaries and thesauri), and ontologies allow broader terms, narrower terms, and related terms to be incorporated into queries.Controlled vocabularies are one way to overcome some of the most severe constraints of Boolean keyword queries. Over the years, additional auxiliary structures of general interest, such as the large synonym sets of WordNet, have been constructed. It was shown that concept search which is based on auxiliary structures, such as WordNet, can be efficiently implemented by reusing retrieval models and data structures of classical Information Retrieval. Later approaches have implemented grammars to expand the range of semantic constructs. The creation of data models that represent sets of concepts within a specific domain (domain ontologies), and which can incorporate the relationships among terms, have also been implemented in recent years.
[...]

Continue

  • eDiscovery - Concept-based search technologies are increasingly being used for Electronic Document Discovery (EDD or eDiscovery) to help enterprises prepare for litigation. In eDiscovery, the ability to cluster, categorize, and search large collections of unstructured text on a conceptual basis is much more efficient than traditional linear review techniques. Concept-based searching is becoming accepted as a reliable and efficient search method that is more likely to produce relevant results than keyword or Boolean searches.
  • Enterprise Search and Enterprise Content Management (ECM) - Concept search technologies are being widely used in enterprise search. As the volume of information within the enterprise grows, the ability to cluster, categorize, and search large collections of unstructured text on a conceptual basis has become essential. In 2004 the Gartner Group estimated that professionals spend 30 percent of their time searching, retrieving, and managing information.The research company IDC found that a 2,000-employee corporation can save up to $30 million per year by reducing the time employees spend trying to find information and duplicating existing documents.
  • Content-Based Image Retrieval (CBIR) - Content-based approaches are being used for the semantic retrieval of digitized images and video from large visual corpora. One of the earliest content-based image retrieval systems to address the semantic problem was the ImageScape search engine. In this system, the user could make direct queries for multiple visual objects such as sky, trees, water, etc. using spatially positioned icons in a WWW index containing more than ten million images and videos using keyframes. The system used information theory to determine the best features for minimizing uncertainty in the classification.The semantic gap is often mentioned in regard to CBIR. The semantic gap refers to the gap between the information that can be extracted from visual data and the interpretation that the same data have for a user in a given situation.The ACM SIGMM Workshop on Multimedia Information Retrieval is dedicated to studies of CBIR.
  • Multimedia and Publishing - Concept search is used by the multimedia and publishing industries to provide users with access to news, technical information, and subject matter expertise coming from a variety of unstructured sources. Content-based methods for multimedia information retrieval (MIR) have become especially important when text annotations are missing or incomplete.
  • Digital Libraries and Archives - Images, videos, music, and text items in digital libraries and digital archives are being made accessible to large groups of users (especially on the Web) through the use of concept search techniques. For example, the Executive Daily Brief (EDB), a business information monitoring and alerting product developed by EBSCO Publishing, uses concept search technology to provide corporate end users with access to a digital library containing a wide array of business content. In a similar manner, the Music Genome Project spawned Pandora, which employs concept searching to spontaneously create individual music libraries or virtual radio stations.
  • Genomic Information Retrieval (GIR) - Genomic Information Retrieval (GIR) uses concept search techniques applied to genomic literature databases to overcome the ambiguities of scientific literature.
  • Human Resources Staffing and Recruiting - Many human resources staffing and recruiting organizations have adopted concept search technologies to produce highly relevant resume search results that provide more accurate and relevant candidate resumes than loosely related keyword results.
[...]

Continue

The effectiveness of a concept search can depend on a variety of elements including the dataset being searched and the search engine that is used to process queries and display results. However, most concept search engines work best for certain kinds of queries:
  • Effective queries are composed of enough text to adequately convey the intended concepts. Effective queries may include full sentences, paragraphs, or even an entire documents. Queries composed of just a few words are not as likely to return the most relevant results.
  • Effective queries do not include concepts in a query that are not the object of the search. Including too many unrelated concepts in a query can negatively affect the relevancy of the result items. For example, searching for information about boating on the Mississippi River would be more likely to return relevant results than a search for boating on the Mississippi River on a rainy day in the middle of the summer in 1967.
  • Effective queries are expressed in a full-text, natural language style similar in style to the documents being searched. For example, using queries composed of excerpts from an introductory science textbook would not be as effective for concept searching if the dataset being searched is made up of advanced, college-level science texts. Substantial queries that better represent the overall concepts, styles, and language of the items for which the query is being conducted are generally more effective.
As with all search strategies, experienced searchers generally refine their queries through multiple searches, starting with an initial seed query to obtain conceptually relevant results that can then be used to compose and/or refine additional queries for increasingly more relevant results. Depending on the search engine, using query concepts found in result documents can be as easy as selecting a document and performing a find similar function. Changing a query by adding terms and concepts to improve result relevance is called query expansion.The use of ontologies such as WordNet has been studied to expand queries with conceptually-related words
[...]

Continue