![](https://crypto4nerd.com/wp-content/uploads/2023/07/0qdyx1lgcb6ErVJQ7-1024x683.jpeg)
The problem of semantic search, particularly in the context of question answering over documents, is examined in this paper. Semantic search currently assumes that the answer to a question is semantically similar to the query, a supposition that can result in lower accuracy and reliability for the search engine. A novel solution, “Semantic Enrichment,” is proposed as an improvement over the conventional method, with an in-depth comparison between the two strategies provided.
In my recent article “The Problem With Semantic Search”, I dived in the assumptions behind semantic search and the limitations they imply.
Semantic search is a technique for retrieving information on the web, involving the understanding of the contextual meaning of the query. However, it poses challenges such as determining the appropriate number of similar chunks to ensure the answer exists within the returned results and handling the problem when the documents searched do not contain an answer to the query. To address these issues, a novel approach called “Semantic Enrichment” is proposed.
In the traditional semantic search approach, documents are first acquired to feed the semantic search algorithm. The data acquisition script fetches the required content from Wikipedia, a freely available resource. It saves the collected data to the local directory for further use. Following this, a list of topics is created, and their content is fetched. The obtained content is divided into chunks, with each chunk being encoded into embeddings using a transformer model. These embeddings serve as inputs for the NearestNeighbors algorithm, providing the most similar chunks to a user query.
The Python code for the data acquisition and semantic search algorithm is illustrated below:
I first need to acquire documents that will serve to the semantic search…