
How to generate summaries from data in your Weaviate vector database with an OpenAI LLM in Python using a concept called “Generative Feedback Loops”
Customer reviews are one of the most important features on Amazon, the world’s largest online retailer. People love learning from others who have spent their own money on a product and what they thought about it to decide whether they should buy it. Since introducing customer reviews in 1995 [1], Amazon has made various improvements to the feature.
On August 14th, 2023, Amazon introduced its latest improvement to customer reviews using Generative AI. Amazon offers millions of products, with some accumulating thousands of reviews. In 2022, 125 million customers contributed nearly 1.5 billion reviews and ratings [1]. The new feature summarizes the customer sentiment from thousands of verified customer reviews into a short paragraph.
The feature can be found at the top of the review section under “Customers say” and comes with a disclaimer that the paragraph is “AI-generated from the text of customer reviews”. Additionally, the feature comes with AI-generated product attributes mentioned across reviews, enabling customers to filter for specific reviews mentioning these attributes.
Currently, this new feature is being tested and is only available to a subset of mobile shoppers in the U.S. across a selection of products. The release of the new feature has already sparked some discussion around the reliability, accuracy, and bias of this type of AI-generated information.
Since summarizing customer reviews is one of the more obvious use cases of Generative AI, other companies such as Newegg or Microsoft have also already released similar features. Although Amazon has not released any details on the technical implementation of this new feature, this article will discuss how you can recreate it for your purposes and implement a simple example.
To recreate the review summary feature, you can follow a concept called Generative feedback loops. It retrieves information from a database to prompt a generative model to generate new data that is then stored back into the database.
Prerequisites
As illustrated above, you will need a database to store the data and a generative model. For the database, we will use a Weaviate vector database, which comes with integrations with many different generative modules (e.g., OpenAI, Cohere, Hugging Face, etc.).
!pip install weaviate-client - upgrade
For the generative model, we will use OpenAI’s gpt-3.5-turbo
for which you will need to have your OPENAI_API_KEY
environment variable set. To obtain an API Key, you need an OpenAI account and then “Create new secret key” under API keys. Since OpenAI’s generative models are directly integrated with Weaviate, you don’t need to install any additional package.
Dataset Overview
For this small example, we will use the Amazon Musical Instruments Reviews dataset (License: CC0: Public Domain) with 10,254 reviews across 900 products on Amazon in the musical instruments category.
import pandas as pd df = pd.read_csv("/kaggle/input/amazon-music-reviews/Musical_instruments_reviews.csv",
usecols = ['reviewerID', 'asin', 'reviewText', 'overall', 'summary', 'reviewTime'])
df = df[df.reviewText.notna()]
Setup
As a first step, you will need to set up your database. You can use Weaviate’s Embedded option for playing around, which doesn’t require any registration or API key setup.
import weaviate
from weaviate import EmbeddedOptions
import osclient = weaviate.Client(
embedded_options=EmbeddedOptions(),
additional_headers={
"X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"]
)
Next, we will define the schema to populate the database with the data (review_text
, product_id
, and reviewer_id
). Note that we’re skipping the vectorization with "skip" : True
here to keep inferencing costs to a minimum. You can enable vectorization if you want to expand this feature and enable semantic search across reviews.
if client.schema.exists("Reviews"):
client.schema.delete_class("Reviews")class_obj = {
"class": "Reviews", # Class definition
"properties": [ # Property definitions
{
"name": "review_text",
"dataType": ["text"],
},
{
"name": "product_id",
"dataType": ["text"],
"moduleConfig": {
"text2vec-openai": {
"skip": True, # skip vectorization for this property
"vectorizePropertyName": False
}
}
},
{
"name": "reviewer_id",
"dataType": ["text"],
"moduleConfig": {
"text2vec-openai": {
"skip": True, # skip vectorization for this property
"vectorizePropertyName": False
}
}
},
],
"vectorizer": "text2vec-openai", # Specify a vectorizer
"moduleConfig": { # Module settings
"text2vec-openai": {
"vectorizeClassName": False,
"model": "ada",
"modelVersion": "002",
"type": "text"
},
"generative-openai": {
"model": "gpt-3.5-turbo"
}
},
}
client.schema.create_class(class_obj)
Now, you can populate the database in batches.
from weaviate.util import generate_uuid5# Configure batch
client.batch.configure(batch_size=100)
# Initialize batch process
with client.batch as batch:
for _, row in df.iterrows():
review_item = {
"review_text": row.reviewText,
"product_id": row.asin,
"reviewer_id": row.reviewerID,
}
batch.add_data_object(
class_name="Reviews",
data_object=review_item,
uuid=generate_uuid5(review_item)
)
Generate new data object (summary)
Now, you can start generating the review summary for every product. Under the hood, you are performing retrieval-augmented generation:
First, prepare a prompt template that can take in review texts as follows:
generate_prompt = """
Summarize these customer reviews into a one-paragraph long overall review:
{review_text}
"""
Then, build a generative search query that follows these steps:
- Retrieve all reviews (
client.query.get('Reviews')
) for a given product (.with_where()
) - Stuff the retrieved review texts into the prompt template and feed it to the generative model (
.with_generate(grouped_task=generate_prompt)
)
summary = client.query
.get('Reviews',
['review_text', "product_id"])
.with_where({
"path": ["product_id"],
"operator": "Equal",
"valueText": product_id
})
.with_generate(grouped_task=generate_prompt)
.do()["data"]["Get"]["Reviews"]
Once a review summary is generated, store it together with the product ID in a new data collection called Products
.
new_review_summary = {
"product_id" : product_id,
"summary": summary[0]["_additional"]["generate"]["groupedResult"]
}# Create new object
client.data_object.create(
data_object = new_review_summary,
class_name = "Products",
uuid = generate_uuid5(new_review_summary)
)
If you want to take this step further, you could also add a cross-reference between the product review summary in the summary class and the product review in the review class (see Generative Feedback Loops for more details).