Improving the Semantic Search Tool | by Chan Kuang Wen

Parse a document into sections and extract their content

With tools like Microsoft Word which has structured XML data behind it, splitting the content by sections shouldn’t be that much of a challenge.

Here’s the catch. Unlike creating a website where h1, h2, and h3 tags are used, many people who create notes on Microsoft Word might not segment their document properly with the headings, subheadings, paragraph text etc. It is simply not a rule that people have to follow, unlike web developers. It doesn’t help that because of aesthetics, many people might end up using random tables or text boxes as section headers. As shown in the sample below, the sections are created using tables, while the subsections are of a different format.

Hence, despite trying many options (such as extracting all texts and segmenting by section numbers), the algorithmic approach did not seem to work. The problem is that there could also be numbers in the chapter, and naive detection of numbers as section and subsection numbers just does not cut it.

Luckily, AI models such as Nougat (developed by Meta) helped to analyse my file and extracted the contents within the segments in markdown language. Although there were a couple of missing segments, it was relatively good enough to be used.

Markdown output by Nougat (left) and rendered output

Analysis of the example above showed that “##” indicates a section while “###” indicates a subsection.

Get Large Language Models to think of questions which the section’s content can answer

Once the content is segmented into sections and subsections, the content in each sub(section) is copied into ChatGPT to generate questions. After some experiments, this was the prompt I went with.

Output of suggested questions from ChatGPT

Get the embeddings and store the question and section pairs in a vector database

All the questions are then collated. It is a painful process to extract the questions one by one. So I edited the prompt such that it outputs a JSON object. This was the output.

[
{
"question": "What are Arenes also known as?",
"answer": "Arenes are also referred to as aromatic hydrocarbons."
},
{
"question": "What is the structural unit of Arenes?",
"answer": "The structural unit of Arenes is the benzene ring."
}
]

With this output, I can now generate embeddings for all of the questions. Once it is done, I will store it within a vector database like Pinecone.

In the semantic search tool, when the user asks a question, the top k similar questions are returned (based on similarity search)

Once the vector database is created, the next few steps are rather straightforward.

When the user sends in a query, the query’s embeddings are generated, and the top k nearest neighbours from cosine similarity scores are returned to the user.

The relevant section to answer the question is returned to the user

Since each section is stored with its relevant questions as key-value pairs, it is easy to find the section’s content using the questions tagged to it.

The relevant section’s content will then be returned to the user as the output. The output can also be used in an LLM as the context.

Source link

Leave a Reply Cancel reply

Related Stories

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

VC-Dimension V.S. Inductive Bias V.S. Biology V.S. Physical Laws : Comprehensive Multi-Disciplinary Table of Machine Learning Classifiers | by Medium_AI_CS_ML | Apr, 2024

Why Machine Learning Is Worth Talking About? | by jupytermishra | Apr, 2024

You may have missed

The Weekly Reorg: Bitcoin Fashion Week

Virtual curating frees artist – Hypergrid Business

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

Azteco Is Helping Millions Buy Bitcoin Without Sharing Their Identity