When we talk about information retrieval, as SEO benefits, we tend to focus heavily on the information gathering stage – the crawl.
During this phase, a search engine will discover and crawl URLs it has access to (the volume and width depending on other factors that we colloquially refer to as a crawl budget).
The crawl phase is not something we are going to focus on in this article, nor am I going to go into depth about how indexing works.
If you want to read more about crawling and indexing, you can do so here.
In this article, I will cover some of the basics of information retrieval, which, when understood, can help you better optimize web pages for ranking performance.
It can also help you better analyze algorithm changes and search engine results page (SERP) updates.
To understand and appreciate how today’s search engines process practical information retrieval, we need to understand the history of information retrieval on the Internet—especially how it relates to search engine processes.
Regarding digital information retrieval and the foundational technologies adopted by search engines, we can go back to the 1960s and Cornell University, where Gerard Salton led a team that developed the SMART Information Retrieval System.
Salton is credited with developing and using vector space modeling for information retrieval.
Vector space models
Vector space models are accepted in the data science community as a key mechanism in how search engines “search” and platforms like Amazon provide recommendations.
This method allows a processor, such as Google, to compare different documents with queries when queries are represented as vectors.
Google referred to it in its documents as vector similarity search, or “nearest neighbor search”, which was defined by Donald Knuth in 1973.
In a traditional query search, the processor will use keywords, tags, tags, etc., within the database to find relevant content.
This is quite limited as it narrows the search field within the database because the answer is a binary yes or no. This method can also be limited when processing synonyms and related entities.
The closer the two entities are in terms of proximity, the less space between the vectors, and the higher in similarity/accuracy they are considered to be.
To combat this and provide results for queries with multiple common interpretations, Google uses vector similarity to tie together various meanings, synonyms, and entities.
A good example of this is when you Google my name.
google, [dan taylor] Can be:
- Me, the SEO person.
- A British sports journalist.
- A local news reporter.
- Lt. Dan Taylor from Forrest Gump.
- A photographer.
- A model maker.
Using traditional keyword search with binary yes/no criteria, you won’t get this spread of results on page one.
With vector search, the processor can produce a search results page based on similarities and relationships between different entities and vectors within the database.
You can read the company’s blog here to learn more about how Google uses it across various products.
Similarity
When comparing documents in this way, search engines probably use a combination of query term weight (QTW) and the similarity coefficient.
QTW applies a weight to specific terms in the query, which is then used to calculate a similarity coefficient using the vector space model and calculated using the cosine coefficient.
The cosine similarity measures the similarity between two vectors and is used in text analysis to measure document similarity.
This is a likely mechanism in how search engines determine duplicate content and value propositions across a website.
Cosine is measured between -1 and 1.
Traditionally, this would be measured on a cosine similarity graph between 0 and 1, with 0 being maximum dissimilarity, or orthogonal, and 1 being maximum similarity.
The role of an index
In SEO, we talk a lot about the index, indexing and indexing problems – but we don’t actively talk about the role of the index in search engines.
The purpose of an index is to store information, which Google does through layered indexing systems and shards, to act as a data reservoir.
This is because it is unrealistic, unprofitable and a poor end-user experience to remotely access (crawl) web pages, analyze their content, rate it and then present a SERP in real time.
Typically, a modern search engine index will not contain a complete copy of every document, but is more of a database of key points and data that has been drawn. The document itself will then live in another cache.
While we don’t know exactly what processes search engines like Google will go through as part of their information retrieval system, they will likely have stages of:
- Structural analysis – Text format and structure, lists, tables, images, etc.
- Vote – Reduce variations of a word to its root. For example, “wanted” and “search” will be reduced to “search”.
- Lexical analysis – Converting the document into a list of words and then analyzing to identify important factors such as dates, authors and term frequency. To note, this is not the same as TF*IDF.
We would also expect other considerations and data points to be taken into account during this phase, such as backlinks, source type, whether or not the document meets the quality threshold, internal linking, main content/supporting content, etc.
Accuracy and post-retrieval
In 2016, Paul Haahr gave great insight into how Google measures the “success” of its process and also how it applies post-recovery adjustments.
You can watch his presentation here.
In most information retrieval systems, there are two primary measures of how successful the system is at producing a good result set.
These are precision and recall.
Precision
The number of documents returned that are relevant to the total number of documents returned.
In recent months, many sites have seen drops in the total number of keywords they are ranking for (such as strange keywords with a lead that they probably had no right to rank for). We can speculate that search engines refine the information retrieval system for greater accuracy.
Recall
The number of relevant documents against the total number of relevant documents returned.
Search engines focus more on accuracy over recall, as accuracy leads to better search results pages and greater user satisfaction. It is also less system intensive to return more documents and process more data than required.
Closure
The practice of information retrieval can be complex due to the different formulas and mechanisms used.
For example:
Since we don’t fully know or understand how this process works in search engines, we should focus more on the basics and guidelines provided versus trying to game stats like TF*IDF that may or may not be used (and differ in how they weigh) in the overall outcome).
More resources:
Featured image: BRO.vector/Shutterstock
window.addEventListener( 'load', function() { setTimeout(function(){ striggerEvent( 'load2' ); }, 2000); });
window.addEventListener( 'load2', function() {
if( sopp != 'yes' && addtl_consent != '1~' && !ss_u ){
!function(f,b,e,v,n,t,s) {if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)}; if(!f._fbq)f._fbq=n;n.push=n;n.loaded=!0;n.version='2.0'; n.queue=[];t=b.createElement(e);t.async=!0; t.src=v;s=b.getElementsByTagName(e)[0]; s.parentNode.insertBefore(t,s)}(window,document,'script', 'https://connect.facebook.net/en_US/fbevents.js');
if( typeof sopp !== "undefined" && sopp === 'yes' ){ fbq('dataProcessingOptions', ['LDU'], 1, 1000); }else{ fbq('dataProcessingOptions', []); }
fbq('init', '1321385257908563');
fbq('track', 'PageView');
fbq('trackSingle', '1321385257908563', 'ViewContent', { content_name: 'information-retrieval-seo', content_category: 'seo technical-seo' }); } });