Mastering Question Answering from Company Documentation with Transformers | by Gabriel Secco

Introduction

Question Answering (QA) is where the power of language models meets the challenge of extracting valuable information from decentralized, disconected or even outdated documentation in your bussiness. In today’s information-rich era, the ability to rapdly retrieve precise answers from excedingly collections of text and sources has become more crucial than ever.

This tutorial will empower you with the knowledge and skills to construct the first steps of your company QA structure. By the end of this tutorial, you’ll have a step-by-step understanding of how to increase your company productivity by making the information and right answers closer to people.

Collect your data

To build a powerful QA system, acquiring relevant data is crucial. In this topic, we will show a sample process of gathering information from wikipedia using python and storing it for later data processing.

Acquiring data from Wikipedia:

import json
import wikipediaapidef collect_wiki_data(topics, save_to_file=True):
# Initialize a Wikipedia session
wiki_wiki = wikipediaapi.Wikipedia('en')
# Collect information from Wikipedia
data = []
for topic in topics:
page = wiki_wiki.page(topic)
if page.exists():
article_data = {
'title': page.title,
'summary': page.summary,
'sections': []
}
for section in page.sections:
article_data['sections'].append({
'title': section.title,
'content': section.text
})
data.append(article_data)
if save_to_file:
# Save the collected data to a file in data folder.
for article in data:
with open(f'data/{article["title"]}.txt', 'w') as f:
print(f"Writing {article['title']} to file")
f.write(f"Title: {article['title']}")
f.write(f"Summary: {article['summary']}")
f.write("Sections:")
for section in article['sections']:
f.write(f"   - {section['title']}: {section['content']}")
f.close()
return data
def save_wiki_data(topics):
data = collect_wiki_data(topics)
# Save the collected data to a file in data folder.
with open('data/_all_data.txt', 'w') as f:
f.write(json.dumps(data))
def main():
# Define a list of topics or search queries
topics = ['Artificial intelligence', 'Machine learning', 'Data science', 'Neural networks', 'Deep learning', 'Natural language processing', 'Computer vision', 'Reinforcement learning', 'Supervised learning', 'Unsupervised learning', 'Semi-supervised learning', 'Recommender systems', 'Data mining', 'Big data', 'Data engineering', 'Data visualization', 'Data analysis', 'Data wrangling', 'Data modeling', 'Data munging', 'Data architecture', 'Data collection', 'Data governance', 'Data quality', 'Data security', 'Data integrity', 'Data enrichment', 'Data transformation', 'Data fusion', 'Data lake', 'Data warehouse', 'Data mart', 'Data silo', 'Data asset', 'Data asset framework', 'Data asset management', 'Data asset metadata', 'Data asset owner', 'Data asset quality']
save_wiki_data(topics)
if __name__ == "__main__":
main()

Ask Questions

We will explore how to perform question answering using Hugging Face Transformers. We will use a pre-trained transformer model to extract answers from a given context(our source text from Wikipedia). See the code below:

from transformers import pipeline
import os# prompt for question
q = input("What is your question? ")
# Load the model
model = pipeline('question-answering')
# Load the data
data = ""
with open("data/_all_data.txt", "r") as f:
data = f.read()
f.close()
next_q = True
while next_q:
# Get the answer
answer = model(question=q, context=data)
print(f"Question: {q}")
print(f"Answer: '{answer['answer']}' with score {answer['score']}")
# Prompt for another question
next_q = input("Do you have another question? (y/n) ")
if next_q == "y":
q = input("What is your question? ")
else:
next_q = False
print("Goodbye!")

Conclusions

Results

The effectiveness of the QA system built using Hugging Face Transformers and a company documentation corpus was able to produce short answers for specific questions like:

Question: what was the first computer to beat a world champion chess player?
Answer: 'Deep Blue' with score 0.9719266891479492

For other simple questions the model gave a bad answer:

Question: what alphago did ?
Answer: 'AlphaGo' with score 0.3987176716327667
Expected: won 4 out of 5 games of Go in a match with Go champion Lee Sedol

Even though it shows the answer with a low score, this information was right next to the alphaGo citation in the corpus.

Limitations

Basic (Single span) question answering using transformer models has its limitations. Firstly, these models are designed to provide concise answers within a specific span of the input document. They may struggle with questions that require multiple spans or complex reasoning beyond a single passage.
It is important to be aware of these limitations when using single span question answering and to exercise critical thinking, fact-checking, and consideration of multiple perspectives when dealing with complex questions or sources with opposing information.

Source link