A concept search (or conceptual search) is an automated information retrieval method that is used to search electronically stored unstructured text (for example, digital archives, email, scientific literature, etc.) for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.
[...]

Continue

Concept search techniques were developed because of limitations imposed by classical Boolean keyword search technologies when dealing with large, unstructured digital collections of text. Keyword searches often return results that include many non-relevant items (false positives) or that exclude too many relevant items (false negatives) because of the effects of synonymy and polysemy. Synonymy means that one of two or more words in the same language have the same meaning, and polysemy means that many individual words have more than one meaning.
Polysemy is a major obstacle for all computer systems that attempt to deal with human language. In English, most frequently used terms have several common meanings. For example, the word fire can mean: a combustion activity; to terminate employment; to launch, or to excite (as in fire up). For the 200 most-polysemous terms in English, the typical verb has more than twelve common meanings, or senses. The typical noun from this set has more than eight common senses. For the 2000 most-polysemous terms in English, the typical verb has more than eight common senses and the typical noun has more than five.
In addition to the problems of polysemy and synonymy, keyword searches can exclude inadvertently misspelled words as well as the variations on the stems (or roots) of words (for example, strike vs. striking). Keyword searches are also susceptible to errors introduced by optical character recognition (OCR) scanning processes, which can introduce random errors into the text of documents (often referred to as noisy text)during the scanning process.
A concept search can overcome these challenges by employing word sense disambiguation (WSD), and other techniques, to help it derive the actual meanings of the words, and their underlying concepts, rather than by simply matching character strings like keyword search technologies.
[...]

Continue

In general, information retrieval research and technology can be divided into two broad categories: semantic and statistical. Information retrieval systems that fall into the semantic category will attempt to implement some degree of syntactic and semantic analysis of the natural language text that a human user would provide (also see computational linguistics). Systems that fall into the statistical category will find results based on statistical measures of how closely they match the query. It must be noted, however, that systems in the semantic category also often rely on statistical methods to help them find and retrieve information.
Efforts to provide information retrieval systems with semantic processing capabilities have basically used three different approaches:
  • Auxiliary structures
  • Local co-occurrence statistics
  • Transform techniques (particularly matrix decompositions)

Auxiliary Structures

A variety of techniques based on Artificial Intelligence (AI) and Natural Language Processing (NLP) have been applied to semantic processing, and most of them have relied on the use of auxiliary structures such as controlled vocabularies and ontologies. Controlled vocabularies (dictionaries and thesauri), and ontologies allow broader terms, narrower terms, and related terms to be incorporated into queries.Controlled vocabularies are one way to overcome some of the most severe constraints of Boolean keyword queries. Over the years, additional auxiliary structures of general interest, such as the large synonym sets of WordNet, have been constructed. It was shown that concept search which is based on auxiliary structures, such as WordNet, can be efficiently implemented by reusing retrieval models and data structures of classical Information Retrieval. Later approaches have implemented grammars to expand the range of semantic constructs. The creation of data models that represent sets of concepts within a specific domain (domain ontologies), and which can incorporate the relationships among terms, have also been implemented in recent years.
[...]

Continue

  • eDiscovery - Concept-based search technologies are increasingly being used for Electronic Document Discovery (EDD or eDiscovery) to help enterprises prepare for litigation. In eDiscovery, the ability to cluster, categorize, and search large collections of unstructured text on a conceptual basis is much more efficient than traditional linear review techniques. Concept-based searching is becoming accepted as a reliable and efficient search method that is more likely to produce relevant results than keyword or Boolean searches.
  • Enterprise Search and Enterprise Content Management (ECM) - Concept search technologies are being widely used in enterprise search. As the volume of information within the enterprise grows, the ability to cluster, categorize, and search large collections of unstructured text on a conceptual basis has become essential. In 2004 the Gartner Group estimated that professionals spend 30 percent of their time searching, retrieving, and managing information.The research company IDC found that a 2,000-employee corporation can save up to $30 million per year by reducing the time employees spend trying to find information and duplicating existing documents.
  • Content-Based Image Retrieval (CBIR) - Content-based approaches are being used for the semantic retrieval of digitized images and video from large visual corpora. One of the earliest content-based image retrieval systems to address the semantic problem was the ImageScape search engine. In this system, the user could make direct queries for multiple visual objects such as sky, trees, water, etc. using spatially positioned icons in a WWW index containing more than ten million images and videos using keyframes. The system used information theory to determine the best features for minimizing uncertainty in the classification.The semantic gap is often mentioned in regard to CBIR. The semantic gap refers to the gap between the information that can be extracted from visual data and the interpretation that the same data have for a user in a given situation.The ACM SIGMM Workshop on Multimedia Information Retrieval is dedicated to studies of CBIR.
  • Multimedia and Publishing - Concept search is used by the multimedia and publishing industries to provide users with access to news, technical information, and subject matter expertise coming from a variety of unstructured sources. Content-based methods for multimedia information retrieval (MIR) have become especially important when text annotations are missing or incomplete.
  • Digital Libraries and Archives - Images, videos, music, and text items in digital libraries and digital archives are being made accessible to large groups of users (especially on the Web) through the use of concept search techniques. For example, the Executive Daily Brief (EDB), a business information monitoring and alerting product developed by EBSCO Publishing, uses concept search technology to provide corporate end users with access to a digital library containing a wide array of business content. In a similar manner, the Music Genome Project spawned Pandora, which employs concept searching to spontaneously create individual music libraries or virtual radio stations.
  • Genomic Information Retrieval (GIR) - Genomic Information Retrieval (GIR) uses concept search techniques applied to genomic literature databases to overcome the ambiguities of scientific literature.
  • Human Resources Staffing and Recruiting - Many human resources staffing and recruiting organizations have adopted concept search technologies to produce highly relevant resume search results that provide more accurate and relevant candidate resumes than loosely related keyword results.
[...]

Continue

The effectiveness of a concept search can depend on a variety of elements including the dataset being searched and the search engine that is used to process queries and display results. However, most concept search engines work best for certain kinds of queries:
  • Effective queries are composed of enough text to adequately convey the intended concepts. Effective queries may include full sentences, paragraphs, or even an entire documents. Queries composed of just a few words are not as likely to return the most relevant results.
  • Effective queries do not include concepts in a query that are not the object of the search. Including too many unrelated concepts in a query can negatively affect the relevancy of the result items. For example, searching for information about boating on the Mississippi River would be more likely to return relevant results than a search for boating on the Mississippi River on a rainy day in the middle of the summer in 1967.
  • Effective queries are expressed in a full-text, natural language style similar in style to the documents being searched. For example, using queries composed of excerpts from an introductory science textbook would not be as effective for concept searching if the dataset being searched is made up of advanced, college-level science texts. Substantial queries that better represent the overall concepts, styles, and language of the items for which the query is being conducted are generally more effective.
As with all search strategies, experienced searchers generally refine their queries through multiple searches, starting with an initial seed query to obtain conceptually relevant results that can then be used to compose and/or refine additional queries for increasingly more relevant results. Depending on the search engine, using query concepts found in result documents can be as easy as selecting a document and performing a find similar function. Changing a query by adding terms and concepts to improve result relevance is called query expansion.The use of ontologies such as WordNet has been studied to expand queries with conceptually-related words
[...]

Continue

Relevance feedback is a feature that helps users determine if the results returned for their queries meet their information needs. In other words, relevance is assessed relative to an information need, not a query. A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.It is a way to involve users in the retrieval process in order to improve the final result set. Users can refine their queries based on their initial results to improve the quality of their final results.
In general, concept search relevance refers to the degree of similarity between the concepts expressed in the query and the concepts contained in the results returned for the query. The more similar the concepts in the results are to the concepts contained in the query, the more relevant the results are considered to be. Results are usually ranked and sorted by relevance so that the most relevant results are at the top of the list of results and the least relevant results are at the bottom of the list.
Relevance feedback has been shown to be very effective at improving the relevance of results.A concept search decreases the risk of missing important result items because all of the items that are related to the concepts in the query will be returned whether or not they contain the same words used in the query.
Ranking will continue to be a part of any modern information retrieval system. However, the problems of heterogeneous data, scale, and non-traditional discourse types reflected in the text, along with the fact that search engines will increasingly be integrated components of complex information management processes, not just stand-alone systems, will require new kinds of system responses to a query. For example, one of the problems with ranked lists is that they might not reveal relations that exist among some of the result items.
[...]

Continue

  1. Result items should be relevant to the information need expressed by the concepts contained in the query statements, even if the terminology used by the result items is different from the terminology used in the query.
  2. Result items should be sorted and ranked by relevance.
  3. Relevant result items should be quickly located and displayed. Even complex queries should return relevant results fairly quickly.
  4. Query length should be non-fixed, i.e., a query can be as long as deemed necessary. A sentence, a paragraph, or even an entire document can be submitted as a query.
  5. A concept query should not require any special or complex syntax. The concepts contained in the query can be clearly and prominently expressed without using any special rules.
  6. Combined queries using concepts, keywords, and metadata should be allowed.
  7. Relevant portions of result items should be usable as query text simply by selecting the item and telling the search engine to find similar items.
  8. Query-ready indexes should be created relatively quickly.
  9. The search engine should be capable of performing Federated searches. Federated searching enables concept queries to be used for simultaneously searching multiple datasources for information, which are then merged, sorted, and displayed in the results.
  10. A concept search should not be affected by misspelled words, typographical errors, or OCR scanning errors in either the query text or in the text of the dataset being searched.
[...]

Continue

Formalized search engine evaluation has been ongoing for many years. For example, the Text REtrieval Conference (TREC) was started in 1992 to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. Most of today's commercial search engines include technology first developed in TREC.
In 1997, a Japanese counterpart of TREC was launched, called National Institute of Informatics Test Collection for IR Systems (NTCIR). NTCIR conducts a series of evaluation workshops for research in information retrieval, question answering, text summarization, etc. A European series of workshops called the Cross Language Evaluation Forum (CLEF) was started in 2001 to aid research in multilingual information access. In 2002, the Initiative for the Evaluation of XML Retrieval (INEX) was established for the evaluation of content-oriented XML retrieval systems.
Precision and recall have been two of the traditional performance measures for evaluating information retrieval systems. Precision is the fraction of the retrieved result documents that are relevant to the user's information need. Recall is defined as the fraction of relevant documents in the entire collection that are returned as result documents.
Although the workshops and publicly available test collections used for search engine testing and evaluation have provided substantial insights into how information is managed and retrieved, the field has only scratched the surface of the challenges people and organizations face in finding, managing, and, using information now that so much information is available.Scientific data about how people use the information tools available to them today is still incomplete because experimental research methodologies haven’t been able to keep up with the rapid pace of change. Many challenges, such as contextualized search, personal information management, information integration, and task support, still need to be addressed.
[...]

Continue

In Internet Marketing, Search Advertising is a method of placing online advertisements on Web pages that show results from search engine queries. Through the same search-engine advertising services, ads can also be placed on Web pages with other published .
Search advertisements are targeted to match key search terms (called keywords) entered on search engines. This targeting ability has contributed to the attractiveness of search advertising for advertisers. Consumers will often use a search engine to identify and compare purchasing options immediately before making a purchasing decision. The opportunity to present consumers with advertisements tailored to their immediate buying interests encourages consumers to click on search ads instead of unpaid search results, which are often less relevant. Unpaid search results are also called organic results

Origins

Its believed that Yahoo! first introduced a stand-alone search advertising buy. In 1996, Chip Royce, head of online marketing for InterZine Productions of Boca Raton, Florida, approached Yahoo!'s sales agent (Softbank Interactive Media Sales) seeking more effective, targeted advertising within Yahoo!'s search results. Yahoo! obliged placing targeted ad banners when the keyword "Golf" was searched by Yahoo! users. Yahoo! later turned this opportunity into a formal marketing program for its entire customer base and promoted this in a July 1996 article in the now defunct magazine.
[...]

Continue

Search advertising is sold and delivered on the basis of keywords. The user of a search engine enters keywords to make queries. A keyword may consist of more than one word.
Search engines conduct running auctions to sell ads according to bids received for keywords and relative relevance of user keywords to ads in the inventory. The keyword “home mortgage refinancing" is more expensive than one that is in less demand, such as “used bicycle tires.” Profit potential of the keywords also plays into bids for ads that advertisers want displayed when the keywords are searched by the user. For example, "used book" may be a popular keyword but may have low profit potential and the advertiser bids will reflect that.
Search engines build indexes of Web pages using a web Crawler. When the publisher of a Web page arranges with a search engine firm to have ads served up on that page, the search engine applies their indexing technology to associate the content of that page with keywords. Those keywords are then fed into the same auctioning system that is used by advertisers to buy ads on both search engine results pages. Advertising based on keywords in the surrounding content or context is referred to as Contextual advertising. This is usually less profitable than search advertising which is based on user intent expressed through their keywords.
Advertisers can choose whether to buy ads on search result pages (search advertising), published content pages (contextual advertising), or both. Bids on the same keywords are usually higher in search advertising than in contextual advertising.
[...]

Continue

Search advertising activities can be measured in five ways:
CPM: Cost per thousand viewers was the original method used for pricing online advertisements. CPM remains the most common method for pricing banner ads.
CTR: Click-through rates measure the number of times an ad is clicked as a percentage of views of the Web page on which the ad appears. Banner ads have CTRs that are generally 0.5 percent or less. In comparison, individual search engine ads can have CTRs of 10 percent, even though they appear alongside organic search results and competing paid search advertisements.
CPA: Cost per action quantifies costs for completing specified activities such as attracting a new customer or making a sale. Affiliate networks operate on a CPA basis. CPA systems function most effectively when sales cycles are short and easily tracked. Longer sales cycles rely on exposure to multiple types of ads to create brand awareness and purchasing interest before a sale is made. Longer sales cycles and sales requiring multiple customer contacts can be difficult to track, leading to a reluctance by publishers to participate in CPA programs beyond initial lead generation.
CPC: Cost per click tracks the cost of interacting with a client or potential client. In traditional marketing, CPC is viewed as a one-way process of reaching target audiences through means such as direct mail, radio ads and television ads. Search advertising provides opportunities for two-way contacts through web-based chat, Internet-based calls, call-back requests or mailing list signups.
TM: Total minutes is a metric being used by Nielsen/NetRatings to measure total time spent on a Web page rather than the number of Web page views. On July 10, 2007, Nielsen announced that they would be relying on TM as their primary metric for measuring Web page popularity, due to changes in the way Web pages provide content through audio and video streaming and by refreshing the same page without totally reloading it. Page refreshes are one aspect of Rich Internet Applications (RIA). RIA technologies include AJAX (Asynchronous JavaScript and XML) and Microsoft Silverlight.
Methodological questions regarding the use of total minutes for search advertising include how to account for Internet users that keep several browser windows open simultaneously, or who simply leave one window open unattended for long periods of time. Another question involves tracking total minutes on HTML pages that are stateless and do therefore do not generate server-side data on the length of time that they are viewed.
[...]

Continue

A metasearch engine is a search tool that sends user requests to several other search engines and/or databases and aggregates the results into a single list or displays them according to their source. Metasearch engines enable users to enter search criteria once and access several search engines simultaneously. Metasearch engines operate on the premise that the Web is too large for any one search engine to index it all and that more comprehensive search results can be obtained by combining the results from several search engines. This also may save the user from having to use multiple search engines separately.
The term "metasearch" is frequently used to classify a set of commercial search engines, see the list of search engines, but is also used to describe the paradigm of searching multiple data sources in real time. The National Information Standards Organization (NISO) uses the terms Federated Search and Metasearch interchangeably to describe this web search paradigm.

Operation

architecture of a metasearch engine
architecture of a metasearch engine
Metasearch engines create what is known as a virtual database. They do not compile a physical database or catalogue of the web. Instead, they take a user's request, pass it to several other heterogeneous databases and then compile the results in a homogeneous manner based on a specific algorithm.
No two metasearch engines are alike. Some search only the most popular search engines while others also search lesser-known engines, newsgroups, and other databases. They also differ in how the results are presented and the quantity of engines that are used. Some will list results according to search engine or database. Others return results according to relevance, often concealing which search engine returned which results. This benefits the user by eliminating duplicate hits and grouping the most relevant ones at the top of the list.
Search engines frequently have different ways they expect requests submitted. For example, some search engines allow the usage of the word "AND" while others require "+" and others require only a space to combine words. The better metasearch engines try to synthesize requests appropriately when submitting them
[...]

Continue

The three most widely used web search engines and their approximate share as of late 2010.
A web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results and are often called hits. The information may consist of web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input.
[...]

Continue


Image of definition link provided for many search terms.
Image of definition link provided for many search terms.
Google search consists of a series of localized websites. The largest of those, the google.com site, is the top most-visited website in the world. Some of its features include a definition link for most searches including dictionary words, the number of results you got on your search, links to other searches (e.g. for words that Google believes to be misspelled, it provides a link to the search results using its proposed spelling), and many more.

Search syntax

Google's search engine normally accepts queries as a simple text, and breaks up the user's text into a sequence of search terms, which will usually be words that are to occur in the results, but one can also use Boolean operators, such as: quotations marks (") for a phrase, a prefix such as "+", "-" for qualified terms, or one of several advanced operators, such as "site:". The webpages of "Google Search Basics" describe each of these additional queries and options (see below: Search options). Google's Advanced Search web form gives several additional fields which may be used to qualify searches by such criteria as date of first retrieval. All advanced queries transform to regular queries, usually with additional qualified term.

Query expansion

Google applies query expansion to the submitted search query, transforming it into the query that will actually be used to retrieve results. As with page ranking, the exact details of the algorithm Google uses are deliberately obscure, but certainly the following transformations are among those that occur:
  • Term reordering: in information retrieval this is a standard technique to reduce the work involved in retrieving results. This transformation is invisible to the user, since the results ordering uses the original query order to determine relevance.
  • Stemming is used to increase search quality by keeping small syntactic variants of search terms.
  • There is a limited facility to fix possible misspellings in queries.

"I'm Feeling Lucky"

Google's homepage includes a button labelled "I'm Feeling Lucky". When a user types in a search and clicks on the button the user will be taken directly to the first search result, bypassing the search engine results page. The thought is that if a user is "feeling lucky", the search engine will return the perfect match the first time without having to page through the search results. However, with the introduction of Google Instant, it is not possible to use the button properly unless the Google Instant function is switched off. According to a study by Tom Chavez of "Rapt", this feature costs Google $110 million a year as 1% of all searches use this feature and bypass all advertising.
On October 30, 2009, for some users, the "I'm Feeling Lucky" button was removed from Google's main page, along with the regular search button. Both buttons were replaced with a field that reads, "This space intentionally left blank." This text faded out when the mouse was moved on the page, and normal search functionality is achieved by filling in the search field with the desired terms and pressing enter. A Google spokesperson explains, "This is just a test, and a way for us to gauge whether our users will like an even simpler search interface." Personalized Google homepages retained both buttons and their normal functions.
On May 21, 2010, the 30th anniversary of Pac-Man, the "I'm Feeling Lucky" button was replaced with a button reading the words "Insert Coin". After pressing the button, the user would begin a Google-themed game of Pac-Man in the area where the Google logo would normally be. Pressing the button a second time would begin a two-player version of the same game that includes Ms. Pacman for player 2. This version can be accessed at www.google.com/pacman/ as a permanent link to the page.

Rich Snippets

On 12 May 2009, Google announced that they would be parsing the hCard, hReview, and hProduct microformats and using them to populate search result pages with what they called "Rich Snippets".

Special features

Besides the main search-engine feature of searching for text, Google Search has more than 22 "special features" (activated by entering any of dozens of trigger words) when searching:
  • weather – The weather conditions, temperature, wind, humidity, and forecast,for many cities, can be viewed by typing "weather" along with a city for larger cities or city and state, U.S. zip code, or city and country for smaller cities (such as: weather Lawrence, Kansas; weather Paris; weather Bremen, Germany).
  • stock quotes – The market data for a specific company or fund can be viewed, by typing the ticker symbol (or include "stock"), such as: CSCO; MSFT; IBM stock; F stock (lists Ford Motor Co.); or AIVSX (fund). Results show inter-day changes, or 5-year graph, etc. This does not work for stock names which are one letter long, such as Citigroup (C) or Macy's (M) (Ford being an exception), or are common words, such as Diamond Offshore (DO) or Majesco (COOL).
  • time – The current time in many cities (worldwide),can be viewed by typing "time" and the name of the city (such as: time Cairo; time Pratt, KS).
  • sports scores – The scores and schedules, for sports teams,can be displayed by typing the team name or league name into the search box.
  • unit conversion – Measurements can be converted,by entering each phrase, such as: 10.5 cm in inches; or 90 km in miles
  • currency conversion – A money or currency converter can be selected,by typing the names or currency codes (listed by ISO 4217): 6789 Euro in USD; 150 GBP in USD; 5000 Yen in USD; 5000 Yuan in lira (the U.S. dollar can be USD or "US$" or "$", while Canadian is CAD, etc.).
  • calculator – Calculation results can be determined,as calculated live, by entering a formula in numbers or words, such as: 6*77 +pi +sqrt(e^3)/888 plus 0.45. The user is given the option to search for the formula, after calculation. The calculator also uses the unit and currency conversion functions to allow unit-aware calculations. For example, "(3 EUR/liter) / (40 miles/gallon) in USD / mile" calculates the dollar cost per mile for a 40 mpg car with gas costing 3 euros a liter. The caret "^" raises a number to an exponent power, and percentages are allowed ("40% of 300").There is also some debate as to Google's calculation of 0^0. Many mathematicians believe that 0^0 is undefined but Google's calculator shows the result as 1.
  • numeric ranges – A set of numbers can be matched by using a double-dot between range numbers (70..73 or 90..100) to match any positive number in the range, inclusive.Negative numbers are treated as using exclusion-dash to not match the number.
  • dictionary lookup – A definition for a word or phrase can be found,by entering "define" followed by a colon and the word(s) to lookup (such as, "define:philosophy")
  • maps – Some related maps can be displayed,by typing in the name or U.S. ZIP code of a location and the word "map" (such as: New York map; Kansas map; or Paris map).
  • movie showtimes – Reviews or film showtimes can be listed for any movies playing nearby,by typing "movies" or the name of any current film into the search box. If a specific location was saved on a previous search, the top search result will display showtimes for nearby theaters for that movie.
  • public data – Trends for population (or unemployment rates) can be found for U.S. states & counties, by typing "population" or "unemployment rate" followed by a state or county name.
  • real estate and housing – Home listings in a given area can be displayed,using the trigger words "housing", "home", or "real estate" followed by the name of a city or U.S. zip code.
  • travel data/airports – The flight status for arriving or departing U.S. flights can be displayed,by typing in the name of the airline and the flight number into the search box (such as: American airlines 18). Delays at a specific airport can also be viewed (by typing the name of the city or three-letter airport code plus word "airport").
  • package tracking – Package mail can be tracked by typing the tracking number of a Royal Mail, UPS, FedEx or USPS package directly into the search box. Results will include quick links to track the status of each shipment.
  • patent numbers – U.S. patents can be searched by entering the word "patent" followed by the patent number into the search box (such as: Patent 5123123).
  • area code – The geographical location (for any U.S. telephone area code) can be displayed by typing a 3-digit area code (such as: 650).
  • synonym search – A search can match words similar to those specified,by placing the tilde sign (~) immediately in front of a search term, such as:  ~fast food.

Search options

The webpages maintained by the Google Help Center have text describing more than 15 various search options.The Google operators:
  • OR – Search for either one, such as "price high OR low" searches for "price" with "high" or "low".
  • "-" – Search while excluding a word, such as "apple -tree" searches where word "tree" is not used.
  • "+" – Force inclusion of a word, such as "Name +of +the Game" to require the words "of" & "the" to appear on a matching page.
  • "*" – Wildcard operator to match any words between other specific words.
Some of the query options are as follows:
  • define: – The query prefix "define:" will provide a definition of the words listed after it.
  • stocks: – After "stocks:" the query terms are treated as stock ticker symbols for lookup.
  • site: – Restrict the results to those websites in the given domain,such as, site:www.acmeacme.com. The option "site:com" will search all domain URLs named with ".com" (no space after "site:").
  • allintitle: – Only the page titles are searched (not the remaining text on each webpage).
  • intitle: – Prefix to search in a webpage title,such as "intitle:google search" will list pages with word "google" in title, and word "search" anywhere (no space after "intitle:").
  • allinurl: – Only the page URL address lines are searched (not the text inside each webpage).
  • inurl: – Prefix for each word to be found in the URL;others words are matched anywhere, such as "inurl:acme search" matches "acme" in a URL, but matches "search" anywhere (no space after "inurl:").
The page-display options (or query types) are:
  • cache: – Highlights the search-words within the cached document, such as "cache:www.google.com xxx" shows cached content with word "xxx" highlighted.
  • link: – The prefix "link:" will list webpages that have links to the specified webpage, such as "link:www.google.com" lists webpages linking to the Google homepage.
  • related: – The prefix "related:" will list webpages that are "similar" to a specified web page.
  • info: – The prefix "info:" will display some background information about one specified webpage, such as, info:www.google.com. Typically, the info is the first text (160 bytes, about 23 words) contained in the page, displayed in the style of a results entry (for just the 1 page as matching the search).
  • filetype: – results will only show files of the desired type (ex filetype:pdf will return pdf files)
Note that Google searches the HTML coding inside a webpage, not the screen appearance: the words displayed on a screen might not be listed in the same order in the HTML coding.

Error messages

Some searches will give a 403 Forbidden error with the text
"We're sorry... ... but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can't process your request right now. We'll restore your access as quickly as possible, so try again soon. In the meantime, if you suspect that your computer or network has been infected, you might want to run a virus checker or spyware remover to make sure that your systems are free of viruses and other spurious software. We apologize for the inconvenience, and hope we'll see you again on Google."
sometimes followed by a CAPTCHA prompt.
Google's Server Error page
The screen was first reported in 2005, and was a response to the heavy use of Google by search engine optimization companies to check on ranks of sites they were optimizing. The message is triggered by high volumes of requests from a single IP address. Google apparently uses the Google cookie as part of its determination of refusing service.
In June 2009, after the death of pop superstar Michael Jackson, this message appeared to many internet users who were searching Google for news stories related to the singer, and was assumed by Google to be a DDoS attack, although many queries were submitted by legitimate searchers.

January 2009 malware bug

A screen-shot of the error of January 31, 2009.
Google flags search results with the message "This site may harm your computer" if the site is known to install malicious software in the background or otherwise surreptitiously. Google does this to protect users against visiting sites that could harm their computers. For approximately 40 minutes on January 31, 2009, all search results were mistakenly classified as malware and could therefore not be clicked; instead a warning message was displayed and the user was required to enter the requested URL manually. The bug was caused by human error.The URL of "/" (which expands to all URLs) was mistakenly added to the malware patterns file.

Doodle for Google

On certain occasions, the logo on Google's webpage will change to a special version, known as a "Google Doodle". Clicking on the Doodle links to a string of Google search results about the topic. The first was a reference to the Burning Man Festival in 1998,and others have been produced for the birthdays of notable people like Albert Einstein, historical events like the interlocking Lego block's 50th anniversary and holidays like Valentine's Day.

Google Caffeine

In August 2009, Google announced the rollout of a new search architecture, codenamed "Caffeine".The new architecture was designed to return results faster and to better deal with rapidly updated information from services including Facebook and Twitter.Google developers noted that most users would notice little immediate change, but invited developers to test the new search in its sandbox.Differences noted for their impact upon search engine optimization included heavier keyword weighting and the importance of the domain's age.The move was interpreted in some quarters as a response to Microsoft's recent release of an upgraded version of its own search service, renamed Bing.Google announced completion of Caffeine on 8 June 2010, claiming 50% fresher results due to continuous updating of its index.With Caffeine, Google moved its back-end indexing system away from MapReduce and onto BigTable, the company's distributed database platform.Caffeine is also based on Colossus, or GFS2,an overhaul of the GFS distributed file system.

Encrypted Search

In May 2010 Google rolled out SSL-encrypted web search.The encrypted search can be accessed at encrypted.google.com

Instant Search

Google Instant, an enhancement that displays suggested results while the user types, was introduced on September 8, 2010. One concern is that people will select one of the suggested results instead of finishing their request, and that such a practice could cause bias toward familiar businesses or other search terms. Pornographic or otherwise offensive search terms are excluded from the suggested results. The instant feature appears only on the basic Google site and not specialized iGoogle pages. Google expects Google Instant to save users 2 to 5 seconds in every search, which they say will be collectively 11 million seconds per hour.Search engine marketing pundits speculate that Google Instant will have a great impact on local and paid search.
In concert with the Google Instant launch, Google disabled the ability of users to choose to see more than 10 search results per page. Instant Search can be disabled via Google's "preferences" menu, but autocomplete-style search suggestions now cannot be disabled. A Google representative stated, "It's in keeping with our vision of a unified Google search experience to make popular, useful features part of the default experience, rather than maintain different versions of Google. As Autocomplete quality has improved, we felt it was appropriate to have it always on for all of our users."

Negative reception

Many users have reported being unable to save the Instant Search "off" setting in their Google preferences.

Censorship

The publication 2600: The Hacker Quarterly has compiled a list of words that are restricted by Google Instant.These are terms the web giant's new instant search feature will not search.Most terms are often vulgar and derogatory in nature, but some apparently irrelevant searches including "Myleak" are removed.

Features

 June 2011 redesign

In late June 2011, Google introduced a new look to the Google home page in order to boost the use of the Google+ social tools.

 Google +1

Google +1
+1 helps people discover relevant content aimed to increase Google search result status offering a status to show people know and trust the content. When a signed-in Google user is searching, Google search result snippet will show a +1 button to recommend the page and an annotation with the names of the user's connections who've +1'd your page.

Interface features

  • Simple white background with time-to-time changes in the title, “Google” with a special historic day or to celebrate a certain day.
  • Top bar has Web, Images, Videos, Maps, News, Shopping, Gmail, and more.
  • Voice search, allowing faster input than typing, or if the correct spelling is not known.
  • When signed into your Google account, your search history will be automatically recorded.
  • Google Instant, which rapidly generates possible searches that contain the typed characters. For example if you typed Goo, it would display Google, Google maps, Google translate...

Media features

  • Share your own picture to the world. You can upload your own picture.
  • Image search with optional settings such as size, color, type, and sorting.
  • Video search that is connected to YouTube. Also with optional setting such as duration, time, quality, also other sources that is related to the topic that you are searching.

Navigation bar design

One of the major changes was replacing the classic navigation bar with a black one. Google's digital creative director Chris Wiggins explains: "We're working on a project to bring you a new and improved Google experience, and over the next few months, you'll continue to see more updates to our look and feel."The new navigation bar has been negatively received by a vocal majority.

Google logo size reduction

The new design reduced the size of the Google logo.

 Move of links

Links for advertising, business partners and company information pushed to the bottom edges of the browser.

 International

Google is available in many languages and has been localized completely or partly for many countries.

Languages

The interface has also been made available in some languages for humorous purpose:

 Domain names

In addition to the main URL Google.com, Google Inc. owns 160 domain names for each of the countries/regions in which it has been localized.

 Search products

In addition to its tool for searching webpages, Google also provides services for searching images, Usenet newsgroups, news websites, videos, searching by locality, maps, and items for sale online. In 2006, Google has indexed over 25 billion web pages,400 million queries per day,1.3 billion images, and over one billion Usenet messages. It also caches much of the content that it indexes. Google operates other tools and services including Google News, Google Suggest, Google Product Search, Google Maps, Google Co-op, Google Earth, Google Docs, Picasa, Panoramio, YouTube, Google Translate, Google Blog Search and Google Desktop Search.
There are also products available from Google that are not directly search-related. Gmail, for example, is a webmail application, but still includes search features; Google Browser Sync does not offer any search facilities, although it aims to organize your browsing time.
Also Google starts many new beta products, like Google Social Search or Google Image Swirl.

Energy consumption

Google claims that a search query requires altogether about 1 kJ or 0.0003 kW·h
[...]

Continue

PageRank

Google's rise to success was in large part due to a patented algorithm called PageRank that helps rank web pages that match a given search string.When Google was a Stanford research project, it was nicknamed BackRub because the technology checks backlinks to determine a site's importance. Previous keyword-based methods of ranking search results, used by many search engines that were once more popular than Google, would rank pages by how often the search terms occurred in the page, or how strongly associated the search terms were within each resulting page. The PageRank algorithm instead analyzes human-generated links assuming that web pages linked from many important pages are themselves likely to be important. The algorithm computes a recursive score for pages, based on the weighted sum of the PageRanks of the pages linking to them. PageRank is thought to correlate well with human concepts of importance. In addition to PageRank, Google, over the years, has added many other secret criteria for determining the ranking of pages on result lists, reported to be over 200 different indicators.The specifics of which are kept secret to keep spammers at bay and facilitate Google maintain an edge over its competitors globally.

Search results

The exact percentage of the total of web pages that Google indexes is not known, as it is very difficult to accurately calculate. Google not only indexes and caches web pages, but also takes "snapshots" of other file types, which include PDF, Word documents, Excel spreadsheets, Flash SWF, plain text files, and so on. Except in the case of text and SWF files, the cached version is a conversion to (X)HTML, allowing those without the corresponding viewer application to read the file.
Users can customize the search engine, by setting a default language, using the "SafeSearch" filtering technology and set the number of results shown on each page. Google has been criticized for placing long-term cookies on users' machines to store these preferences, a tactic which also enables them to track a user's search terms and retain the data for more than a year. For any query, up to the first 1000 results can be shown with a maximum of 100 displayed per page. The ability to specify the number of results is available only if "Instant Search" is not enabled. If "Instant Search" is enabled, only 10 results are displayed, regardless of this setting.

 Non-indexable data

Despite its immense index, there is also a considerable amount of data available in online databases which are accessible by means of queries but not by links. This so-called invisible or deep Web is minimally covered by Google and other search engines.The deep Web contains library catalogs, official legislative documents of governments, phone books, and other content which is dynamically prepared to respond to a query.

Google optimization

Since Google is the most popular search engine, many webmasters have become eager to influence their website's Google rankings. An industry of consultants has arisen to help websites increase their rankings on Google and on other search engines. This field, called search engine optimization, attempts to discern patterns in search engine listings, and then develop a methodology for improving rankings to draw more searchers to their client's sites.
Search engine optimization encompasses both "on page" factors (like body copy, title elements, H1 heading elements and image alt attribute values) and Off Page Optimization factors (like anchor text and PageRank). The general idea is to affect Google's relevance algorithm by incorporating the keywords being targeted in various places "on page", in particular the title element and the body copy (note: the higher up in the page, presumably the better its keyword prominence and thus the ranking). Too many occurrences of the keyword, however, cause the page to look suspect to Google's spam checking algorithms.
Google has published guidelines for website owners who would like to raise their rankings when using legitimate optimization consultants.
It has been hypothesized, and, allegedly, is the opinion of the owner of one business about which there has been numerous complaints, that negative publicity, for example, numerous consumer complaints, may serve as well to elevate page rank on Google Search as favorable comments.The particular problem addressed in The New York Times article, which involved DecorMyEyes, was addressed shortly thereafter by an undisclosed fix in the Google algorithm. According to Google, it was not the frequently published consumer complaints about DecorMyEyes which resulted in the high ranking but mentions on news websites of events which affected the firm such as legal actions against it.There has also been a Google toolbar for use on many internet browsers.
[...]

Continue