This is Article 2 of the series “Vector Search” where I will be discussing vector representation techniques and pre-processing steps. For getting an overview of the vector, I would suggest reading Introduction to Vector-Based Search where i have provided a basic explanation of Vector Search.
Let’s start with a quick recap and understand Vector. What powers this search technique → Vector. Vectors are numerical representations of the data. As you might know that our computer 💻 can understand only numbers. So when you might have searched “Wooden Table” in IKEA, the system does not understand these words, it converts into a numerical representation to perform your search. Just to recall, I have mentioned that these vectors capture the semantic meaning of text or sentence. Considering, the search term “Wooden Table” will have similar vectors as to the “Dining Table” as they denote similar objects. Here, each dimension of the vector corresponds to a specific feature, and the value in that dimension represents the magnitude or strength of that feature for the object being represented.
Although the concept of vector-based search has been explained, it cannot be solved with just a concept. You would either need the appropriate tool or create a method to solve the problem. While it’s great to create our method, it’s worth exploring the currently available methods that could potentially aid in finding a solution.
By now, you must have realized the significance of vectorization in the entire process. This article will furnish information regarding open-source methods that can be employed for content catering to both technical and non-technical readers. Before delving into embedding techniques, it is essential to apply certain pre-processing steps that could potentially result in more significant embeddings.
You might have heard that it is important to warm up before starting the workout💪. Or let’s make it simpler, how important is a starter😋 before the main course😉? It’s kind of similar, pre-processing is the step where we set the stage for the main event, making sure everything runs smoothly and efficiently.
Note: In this section, we will solely focus on textual data. Including image pre-processing would only add unnecessary complexity to the article.
First thing first, let’s address the elephant in the room → “Messy” Data. You might have got the data from the client or it might have been collected based on the live feed, most of the time it is unstructured. What does unstructured really mean? The data might contain:
- Weird characters {$, %, Š, Ú, etc.}
- Spelling errors
- All sorts of quirks that make you scratch your head🤔
It’s like trying to solve a puzzle with missing pieces. Pre-processing can help you solve these silly problems. Just remember, don’t ignore these quirks of the data, it can result in the wrong vector representation of the data if left unchecked. Hence, it might have been suggested to perform EDA on the data before starting the real process.
So, as I have provided an explanation about pre-processing, let’s dive into the technical aspect. You can perform the following steps to pre-process:
- Convert to lowercase: This step ensures consistency by converting all characters to lowercase, regardless of their original casing.
- Replace inconsistency: Text data may contain URLs, special characters, or symbols that can cause issues in further analysis. Apply regex to replace URL and special characters with good old spaces.
- Remove Concurrent Spaces: In some cases, the previous step might introduce multiple consecutive spaces. To tidy up the text, remove these extra spaces, reducing them to a single space.
- Tokenization: Considering the text can be a bit long, our first step will be to convert the text into smaller, manageable chunks, called tokens. This will help in further analysis.
- Stopwords Removal: Now you might have noticed words like “the”, “and” and “is” are spread over the text but they don’t provide insight and clutter analysis. We can remove these using NLTK.
- Word Normalization: This step will ensure consistency in the variation of the words. Inflected words like ‘studying’ can be converted to their base form. You can perform normalization using the following ways depending on the requirement:
— Stemming: It is an aggressive approach for the word normalization. This method removes the prefixes or suffixes to reduce the word to their root form. You should consider this method if a fast process is required as the result might not always be meaningful words.
— Lemmatization: If your requirement is to use an intelligent approach, this method will convert the word into its base form based on its meaning in the sentence. Being a complex process, it can be computationally expensive which needs to be considered before selecting the method.
And voila! With this transformation, your text is ready for vectorization. These steps will ensure the data is consistent with reduced noise and have enhanced quality before undergoing the process of vectorization.
Now we have completed the pre-processing, but still, these are words just processed to have a cleaner structure of the same data. But until now you must have understood that it cannot directly be used as we need some type of numerical representation, Vectors/Embeddings.
Word2vec / GloVe → Bring words to life by representing them as dense vectors. The technique uses a neural network to convert text into vectors based on the context it learned. There are two ways this algorithm can utilize, either to create a custom model or load a model from the available List of Word2vec Models for using Word2vec. If you wish to use the GloVe algorithm, it will be a similar approach for loading the model from the List of GloVe Models.
In the example above, we trained a word2vec model on a corpus of sentences. You may notice the comments at the end, which were added to help those considering the use of a pre-trained GloVe model instead of training their own. If you’re unsure which approach to take, it’s best to train the model if there is enough data available, otherwise using a pre-trained model is recommended.
In the previous section, we discussed word embedding as a method of converting words to vectors. However, when dealing with a large collection of sentences or documents, a simple word embedding technique may not be efficient. Fortunately, we have tools like Doc2Vec and BERT to help us out.
Doc2Vec takes into account the context of words within documents to create document vectors, while BERT uses a transformer-based neural network architecture to capture the bidirectional relationships between words in a document.
There are plenty of articles available that can guide you through the BERT or Doc2Vec approach, but my aim is to introduce you to vector search. To do this, you need vectors of your data corpus. Think of this as a basic approach that you can build on and even create production-level code to solve your problem. You can use the above code to apply either the Doc2Vec approach from Gensim or a BERT model. If you have a document-level task with limited data or computational resources, you’ll find Doc2vec to be your best choice. However, for tasks that require advanced contextual language understanding and have large data sets with high computational resources, BERT can capture better relationships between words and their context.
Have you heard the saying, “A picture is worth a thousand words” — Fred R. Barnard. Text can be converted into vectors by the methods described, but what about images? CNN-based embedding comes to the rescue🛡️. Convolution Neural Network(CNN) model can extract rich visual features from the image. By leveraging pre-trained CNN models like VGG or ResNet, it is possible to get vector representation for the image. These embeddings capture various aspects like shapes, colours, and textures, allowing machines to compare and analyze images efficiently.
Now that you have an idea about the applications of image embedding, it’s not enough to just understand it. You’ll need the code to convert those stunning images into vectors that can help you find that dress you’ve been eyeing, but only have a picture of. VGG16 is one of the state-of-the-art pre-trained models on the Imagenet dataset, featuring a vast range of images. This powerhouse model can extract rich features and representations from images. For more complex problems, you can use other models such as ResNet. In essence, this process involves capturing the image’s essence in a concise and meaningful format.
Each vector representation has its strength and weakness. If you want to search using the image and find similar images like Google Lens, your go-to will be Image Embeddings. If you have a task to find similar words or have a goal to categorize text into predefined classes, your solution lies in Word Embedding as it captures word-level semantics. At last, if you have the task to retrieve information from a large collection of data or need to summarize text, you would be suggested to use Document Embedding as it captures the holistic meaning of text.
By utilizing these techniques, researchers and developers can build powerful systems that have the power to understand and analyze a wide range of data types. This opens new possibilities for intelligent machines.
So, if you ever wonder how your google translate was able to give you great results or if the feed for product recommendations helped you find that one product you were looking for in the sea of information, think of vector representation working behind the scenes.
Thank you for reading the article. I hope you gained an understanding of the importance of vector representation in vector-based search. If you have any suggestions or spot any mistakes, please leave a comment to help me improve.