
Vector databases store data as high-dimensional vectors, representing features or attributes mathematically. Each vector is associated with a certain number of dimensions, ranging from tens to thousands, depending on the complexity and granularity of the data. These vectors are generated by embedding raw data, such as text, images, audio, video, and others.
- Similarity Search: Used for computing the similarity between a pair of objects. It is essential for computing the similarity between vector embeddings.
- Fast Retrieval of Data: The concept of distance between vectors (Euclidean, Manhattan, Cosine, and Chebyshev) is used, which helps us in classifying the data effectively, resulting in fast data retrieval.
- Improved query performance.
- Highly scalable and flexible.
- High Dimensional Search: Gives us a wide range of data to operate upon.
For example, we can use a vector database to:
- Find images that are similar to a given image based on their visual content and style
- Find documents that are similar to a given document based on their topic and sentiment
- In general, find products that are similar to a given product based on their features and ratings
We use a query vector that represents our desired information, to perform similarity search and retrieve desired information from the vector database. The query vector can be either derived from:
- Same type of data as the stored vectors (using an image as a query for an image database)
- From different types of data (e.g., using text as a query for an image database).
Then, we need to use a similarity measure that calculates how close or distant two vectors are in the vector space. The similarity measure can be based on various metrics, such as cosine similarity, euclidean distance, hamming distance, or jaccard index.