![](https://crypto4nerd.com/wp-content/uploads/2023/10/1KgV9Nj5fnJQ_AsVtszyOFA.png)
Parse a document into sections and extract their content
With tools like Microsoft Word which has structured XML data behind it, splitting the content by sections shouldn’t be that much of a challenge.
Here’s the catch. Unlike creating a website where h1, h2, and h3 tags are used, many people who create notes on Microsoft Word might not segment their document properly with the headings, subheadings, paragraph text etc. It is simply not a rule that people have to follow, unlike web developers. It doesn’t help that because of aesthetics, many people might end up using random tables or text boxes as section headers. As shown in the sample below, the sections are created using tables, while the subsections are of a different format.
Hence, despite trying many options (such as extracting all texts and segmenting by section numbers), the algorithmic approach did not seem to work. The problem is that there could also be numbers in the chapter, and naive detection of numbers as section and subsection numbers just does not cut it.
Luckily, AI models such as Nougat (developed by Meta) helped to analyse my file and extracted the contents within the segments in markdown language. Although there were a couple of missing segments, it was relatively good enough to be used.
Analysis of the example above showed that “##” indicates a section while “###” indicates a subsection.
Get Large Language Models to think of questions which the section’s content can answer
Once the content is segmented into sections and subsections, the content in each sub(section) is copied into ChatGPT to generate questions. After some experiments, this was the prompt I went with.
Get the embeddings and store the question and section pairs in a vector database
All the questions are then collated. It is a painful process to extract the questions one by one. So I edited the prompt such that it outputs a JSON object. This was the output.
[
{
"question": "What are Arenes also known as?",
"answer": "Arenes are also referred to as aromatic hydrocarbons."
},
{
"question": "What is the structural unit of Arenes?",
"answer": "The structural unit of Arenes is the benzene ring."
}
]
With this output, I can now generate embeddings for all of the questions. Once it is done, I will store it within a vector database like Pinecone.
In the semantic search tool, when the user asks a question, the top k similar questions are returned (based on similarity search)
Once the vector database is created, the next few steps are rather straightforward.
When the user sends in a query, the query’s embeddings are generated, and the top k nearest neighbours from cosine similarity scores are returned to the user.
The relevant section to answer the question is returned to the user
Since each section is stored with its relevant questions as key-value pairs, it is easy to find the section’s content using the questions tagged to it.
The relevant section’s content will then be returned to the user as the output. The output can also be used in an LLM as the context.