In general, information retrieval research and technology can be divided into two broad categories: semantic and statistical. Information retrieval systems that fall into the semantic category will attempt to implement some degree of syntactic and semantic analysis of the natural language text that a human user would provide (also see computational linguistics). Systems that fall into the statistical category will find results based on statistical measures of how closely they match the query. It must be noted, however, that systems in the semantic category also often rely on statistical methods to help them find and retrieve information.
Efforts to provide information retrieval systems with semantic processing capabilities have basically used three different approaches:
  • Auxiliary structures
  • Local co-occurrence statistics
  • Transform techniques (particularly matrix decompositions)

Auxiliary Structures

A variety of techniques based on Artificial Intelligence (AI) and Natural Language Processing (NLP) have been applied to semantic processing, and most of them have relied on the use of auxiliary structures such as controlled vocabularies and ontologies. Controlled vocabularies (dictionaries and thesauri), and ontologies allow broader terms, narrower terms, and related terms to be incorporated into queries.Controlled vocabularies are one way to overcome some of the most severe constraints of Boolean keyword queries. Over the years, additional auxiliary structures of general interest, such as the large synonym sets of WordNet, have been constructed. It was shown that concept search which is based on auxiliary structures, such as WordNet, can be efficiently implemented by reusing retrieval models and data structures of classical Information Retrieval. Later approaches have implemented grammars to expand the range of semantic constructs. The creation of data models that represent sets of concepts within a specific domain (domain ontologies), and which can incorporate the relationships among terms, have also been implemented in recent years.

Handcrafted controlled vocabularies contribute to the efficiency and comprehensiveness of information retrieval and related text analysis operations, but they work best when topics are narrowly defined and the terminology is standardized. Controlled vocabularies require extensive human input and oversight to keep up with the rapid evolution of language. They also are not well-suited to the growing volumes of unstructured text covering an unlimited number of topics and containing thousands of unique terms because new terms and topics need to be constantly introduced. Controlled vocabularies are also prone to capturing a particular world view at a specific point in time, which makes them difficult to modify if concepts in a certain topic area change.

Local Co-Occurrence Statistics

Information retrieval systems incorporating this approach count the number of times that groups of terms appear together (co-occur) within a sliding window of terms or sentences (for example, ± 5 sentences or ± 50 words) within a document. It is based on the idea that words that occur together in similar contexts have similar meanings. It is local in the sense that the sliding window of terms and sentences used to determine the co-occurrence of terms is relatively small.
This approach is simple, but it captures only a small portion of the semantic information contained in a collection of text. At the most basic level, numerous experiments have shown that approximately only ¼ of the information contained in text is local in nature. In addition, to be most effective, this method requires prior knowledge about the content of the text, which can be difficult with large, unstructured document collections.

Transform Techniques

Some of the most powerful approaches to semantic processing are based on the use of mathematical transform techniques. Matrix decomposition techniques have been the most successful. Some widely used matrix decomposition techniques include the following:
Matrix decomposition techniques are data-driven, which avoids many of the drawbacks associated with auxiliary structures. They are also global in nature, which means they are capable of much more robust information extraction and representation of semantic information than techniques based on local co-occurrence statistics.
Independent component analysis is a technique that works well with data of limited variability, and the semi-discrete and non-negative matrix approaches sacrifice accuracy of representation in order to reduce computational complexity.
Singular value decomposition (SVD) was first applied to text at Bell Labs in the late 1980s. It was used as the foundation for a technique called Latent Semantic Indexing (LSI) because of its ability to find the semantic meaning that is latent in a collection of text. At first, the SVD was slow to be adopted because of the resource requirements needed to work with large datasets. However, the use of LSI has significantly expanded in recent years as earlier challenges in scalability and performance have been overcome. LSI is being used in a variety of information retrieval and text processing applications, although its primary application has been for concept searching and automated document categorization.

Categories:

Leave a Reply