ProjectBlog: Machine Learning to Big Data — Scaling Inverted Indexing with Solr

Blog: Machine Learning to Big Data — Scaling Inverted Indexing with Solr


The amount of data is growing exponentially every day and the volume has increased incredibly over the last few years. With this accelerated growth of data, it has now become important to search collections quickly to serve business decisions and customer queries both offline and online. To process large document collections quickly, at the order of billions to trillions of word per second, the search concept of inverted indexing grew familiar. It stands as one of the fastest mechanisms of retrieving ranked information among many documents that contain certain words or flexible matching words/expressions. The advantages of inverted indexing are:

  • Allow fast full text searches, at a cost of increased processing when a document is added to the database.
  • Easier process to implement.
  • Popular data structure used in distributed scalable document retrieval systems, as for example in search engines.

The blog is structured as follows:

  • Stating the concept of inverted indexing and how it can be used to solve text based classification problems.
  • Describing the architecture and system components to narrate how inverted indexing can be used, for e.g. search-engines and databases.
  • Giving a short overview of Lucene and Solr with an understanding of how they differ from each other.
  • Analyzing “TF-IDF” based Scoring Model where different scoring factors are explained that are used for querying or searching words in the document.

An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.

Steps to create Inverted Index

To create an inverted index, the text data of each document at first is splitted into separate words (termed as terms or tokens). The list of documents after tokenization needs further preprocessing (normalization, stemming, removal of stop-words) on each token to create a sorted list of all the unique terms. The unique terms are then listed per document in which the term appears, to create list of inverted indexes as shown in the figure below.

Text processing from set of documents to create Inverted Index

Apart from reverse indexing, search engines like Lucene is equipped to search on the basis of phrase queries, wildcard and proximity queries, along with sorting, ranking and grouping the searched results.

Indexing and document ranking

Inverted Indexing can be implemented in python using TfidfVectorizer from sklearn.feature_extraction.text library. Its used for text classification processes with supervised learning techniques (Boosting/Bagging/Stacking).

TfIdfVectorizer params:

ngram_range(min, max)= Represents the lower and upper boundary of the range of n-values for different n-grams to be extracted with the condition min_n <= n <= max_n.

max_features = Represents the top max_features ordered by term frequency across the tweet history.

analyzer = Represents the unit used to compose the feature (word or character n-grams).

token = Regular expression with 2 or more alphanumeric characters.

documents = (
"Technology has made us ever more productive",
"As I have pointed out, technology may in fact have limits, but we do not know",
"Technology is simply the combining of other economic products in new ways.",
"This is because technology is cumulative.",
"Technology allowed us to peer deeper into the mysteries of the miniscule.")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print (tfidf_matrix.shape)
print(cosine_similarity(tfidf_matrix[0:1], tfidf_matrix))

(5, 40)

#Vocabulary List
{‘we’: 39, ‘know’: 16, ‘of’: 25, ‘productive’: 30, ‘this’: 35, ‘ever’: 9, ‘limits’: 17, ‘because’: 2, ‘has’: 11, ‘in’: 13, ‘us’: 37, ‘peer’: 28, ‘cumulative’: 5, ‘into’: 14, ‘as’: 1, ‘do’: 7, ‘products’: 31, ‘have’: 12, ‘combining’: 4, ‘not’: 24, ‘allowed’: 0, ‘deeper’: 6, ‘miniscule’: 20, ‘ways’: 38, ‘other’: 26, ‘new’: 23, ‘simply’: 32, ‘economic’: 8, ‘the’: 34, ‘mysteries’: 22, ‘to’: 36, ‘out’: 27, ‘made’: 18, ‘is’: 15, ‘more’: 21, ‘may’: 19, ‘technology’: 33, ‘pointed’: 29, ‘but’: 3, ‘fact’: 10}

#Tf-Idf score

[2.09861229 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229
 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229
 2.09861229 1.69314718 2.09861229 1.69314718 2.09861229 2.09861229
 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229
 2.09861229 1.69314718 2.09861229 2.09861229 2.09861229 2.09861229
 2.09861229 2.09861229 2.09861229 1. 1.69314718 2.09861229
 2.09861229 1.69314718 2.09861229 2.09861229]

# Cosine similarity

[[1. 0.02350305 0.02986958 0.03878472 0.10853509]]

From the cosine similarity score, we can compute the angle between the 2 vectors. Angle theta (θ) can be isolated by taking a cos inverse ( cos^-1) on the right hand side of the measured angle.

To compute similarity between first and last document, we substitute the scores obtained in the previous step in the equations. This gives us a cosine similarity as 0.10853509, as the first document has a similarity score of 1 when compared to itself.

import math
# This was already calculated on the previous step, so we just use the value
cos_sim = .10853509
angle_in_radians = math.acos(cos_sim)
print (math.degrees(angle_in_radians))


Inverted Indexing with Solr

Solr is an Apache licensed open source search engine. It has an exceptionally active developer community and continues to be the foremost search platform in most of the big data platforms. Solr is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead, and stores it in the data directory.

An index consists of one or more Documents, and a Document consists of one or more Fields.

Solr allows defining schemas to facilitate searching items through indexing. It allows us to state:

  • What kinds of fields there are
  • Which field should be used as the unique/primary key
  • Which fields are required
  • How to index and search each field

When data is added to Solr, it goes through a series of transformations (lower-casing, removing word stems etc), popularly known as analysis before being added to the index. The end result of the analysis are a series of tokens which are then added to the index.

Solr and Lucene

Solr uses Lucene underneath. Lucene is a powerful search engine framework that lets us add search capability to our application. It exposes an easy-to-use API while hiding all the search-related complex operations. Any application can use this library, not just Solr.

Solr is built around Lucene. It is not just an http-wrapper around Lucene but has been known to add more features to Lucene. Solr is ready-to-use web application that offers related infrastructure and a lot more features in addition to what Lucene offers.

Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval. VSM model uses term weighting scheme, and the retrieved documents could be sorted according to their relevancy degree.

In VSM, documents and queries are represented as weighted vectors in a multi-dimensional space, where each distinct index term is a dimension. If a term occurs in the document, its value in the vector representation is non-zero. Several different ways of computing the vector values (also known as (term) weights) exists, the one which we have discussed here is Tf-idf weighting.

VSM does not require weights to be Tf-idf values, but Tf-idf values are believed to produce search results of high quality. For given term t and document (or query) x, Tf(t, x) varies with the number of occurrences of term t in x (when one increases so does the other) and idf(t) similarly varies with the inverse of the number of index documents containing term t.

VSM score of document d for query q is given by the Cosine Similarity of the weighted query vectors V(q) and V(d). The the cosine similarity between the query vector and a document vector serves as a measure of the score of the document for that query and is used to select the top-scoring documents for a query.

Lucene’s Practical Scoring Function is given by score(q, d) stated below and represents a function of the query terms found in the specified document.

The different scoring factors of “TF-IDF” based Scoring Model are given below.

tf (t in d) correlates to the term’s frequency, defined as the number of times term t appears in the currently scored document d. Documents that have more occurrences of a given term receive a higher score. As its occurrence is demonstrated twice in both the query and the document, hence it is squared in the equation. tf(t in d) is given by:

idf(t) stands for Inverse Document Frequency, used in NLP — in text mining, search queries and summarization.. This value correlates to the inverse of docFreq (the number of documents in which the term t appears). This means rarer terms give higher contribution to the total score. Its represented by taking the natural log (normalization) of the inverse document frequencies as given below:

coord(q, d) is a score factor based on how many of the query terms are found in the specified document. Documents containing more of the query’s terms will receive a higher score than another document with fewer query terms.

queryNorm(q) is a normalizing factor used to assign scores to queries and make them comparable. As all documents are multiplied by the same factor, it does not affect the overall document ranking scores. This is given by:

The sum of squared weights (of the query terms) is computed by the query Weight object.

t.getBoost() is a search time boost of term t in the query q as specified in the query text, or as set by application calls using setBoost().

norm(t, d) encapsulates a few (indexing time) boost and length factors:

  • Index-time boost — to increase the scores for certain documents that match a certain query, index-time boosts can be used. Index-time boosts can be specified per-field to impart extra boost to queries matching on that specific field. An Index-time boost on a value of a multiValued field applies to all values for that field.
  • Field boost — set by calling field.setBoost() before adding the field to a document. Queries can include “boosts” based on specific attributes of documents, or by using a function on the value of a numeric field.
  • lengthNorm — measures of the importance of a term according to the total number of terms in the field. It is computed when the document is added to the index in accordance with the number of tokens of this field in the document, enabling shorter fields to add more to the score.

When a document is added to the index, all the above factors are multiplied as given below. If the document has multiple fields with the same name, all their boosts are multiplied together:


Inverted Indexing Components : The following figure shows the system architecture (including Elastic Search , MongoDB, SOLR) where reverse indexing can be used to facilitate fast search on multiple documents.

Solr Index Retrieval

An application using Solr for search and indexing retrieves the indexes in step by step manner.

  • When an user runs a search in Solr, the search query is processed by a request handler. As show in the below diagram, on the left hand side, the search request handler is a plugin that defines the logic to be used when Solr processes a request. Solr supports a variety of request handlers. In addition, applications can be configured to allow users to override the default handlers in preference of a custom request handler.
  • The search request handler includes query parser and response writer. To process a search query, a request handler calls a query parser that is responsible for parsing the textual query and converting it into corresponding Lucene query objects. Different query parsers support different syntax. Solr, uses Lucene’s standard query parser for greater precision in searches.
  • The search engine creates internal requests to randomly chosen replicas of every shard in the collection, coordinates the responses, issues any subsequent internal requests as needed (to refine facets values, or to request additional stored fields), and constructs the final response for the client.
  • The response writer component builds the query response object in the required format for the final presentation, generally an XML/JSON/PHP object is returned.
  • Solr has the cloud architecture that enables the distributed capabilities. The system architecture partitions data into multiple instances, or shards, that can be hosted on multiple machines, with replicas providing redundancy for both scalability and fault tolerance. In addition, a ZooKeeper server that helps manage the overall structure so that both indexing and search requests routed are in sync.

Inverted Indexing with Map Reduce: Apache Blur provides distributed search functionality built on the top of Apache Hadoop, Apache Lucene, Apache ZooKeeper, and Apache Thrift. It allows indexing documents in map-reduce architectural framework where one or more documents are distributed to a mapper. The word counts deduced by each mapper can be further aggregated by one or more reducers to create inverted index of the input documents. The final index list stored in HDFS clusters, contains list of words in the files and the set of files that contain each terms and the word frequency in each of the files.

Inverted Index algorithm in MapReduce with three mappers and two reducers. Source :

Indexing databases with Solr




Source: Artificial Intelligence on Medium

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top

Display your work in a bold & confident manner. Sometimes it’s easy for your creativity to stand out from the crowd.