Use LangChain to Build an Document Database for answers you’re looking for.
Ever wonder how to find something in your notes? What if you didn’t have search? What if you’re notes aren’t all in one nice to use platform (IOS notes, notion, or google docs)? What if you have a ton of those notes? What if you don’t want to depend on the internet for all your answers?
Add visibility, versatility and power behind your entire set of documents. In this tutorial, we’ll walk through a basic workflow in python, that will allow you to bring together all the documents your interested in, keep them saved in a very efficient way, and finally use ChatGPT to give you the answers about your data.
Don’t worry we won’t get to deep into the inner-working of ChatGPT or the fantastic and exciting world of attention mechanisms. We’ll focus on the concepts in charge of ingesting, storing and using you’re documents to answer questions.
Even before that we can review how a table in a database works. Databases hold tables of data. We ask the database for data from a table. The table has an index, and if we make it an index that works well, the database gives us our data faster. If we make it an index that doesn’t work well… we’re gonna be waiting.
Vector stores are just like databases in that they store data, they have indexes and they want to give us our information. Vector stores, however, hold our data as a chunk of words — rather than an excel sheet — think of a bunch of word docs. Each chunk is related to a set of numbers (a vector). In order to get that chunk of words — we give the vector store a set of numbers.
Wow… so cool… But seriously, why does this matter. We’re getting there I promise.
How do we chose those sets of numbers related to those chunks of words. We could just randomly pick and each time we hit a new chunk — we get a new set of numbers. OK… but what if instead we use something smarter.
Welcome ChatGPT!
ChatGPT has done monstrous amounts of work to understand words. It knows the meaning, context, even the grammatical positioning of words. Better still… it can give you a set of numbers that represent all of that information. WOW!!! That’s cool. Let’s store our documents (chunks of words) like that.
How do we get the data? There’s an app for that. We’ll use a library. A library is just a bunch of code that gives you a bunch of extra functionality. LangChain is just this library. It gives you a bunch of easy to use equipment to do the heavy lifting.
Don’t worry we get too into the weeds with all the amazing functionality it has. Just a few things.
- A document loader
- A word splitter
- An embedder
The first one is exactly what it sounds like, it just loads the document so LangChain can see it. It’s that simple. There’s lots of different types of loaders, but we’re really only interested in the document loader.
raw_documents = TextLoader('my_file.txt').load()
That code wasn’t so bad.
The second functionality, a word splitter, guess what… it splits up the words (our documents). That’s it. Instead of feeding ChatGPT your giant 5000 word essay on 90’s memorabilia, you it wants smaller bites. So that’s what we do.
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
Woah… there’s a couple of things we put into that function there:
- chunk_size — this let’s LangChain know we want to split it up into chunks that big.
- chunk_overlap-tells it to overlap the chunks so we don’t lose track of the last chunk of information. Don’t worry too much about this.
The last thing we need is to find a set of numbers to use as an index alongside our chunks. This is where we need ChatGPT.
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(documents, embeddings)
What the heck is all that… that first line just creates a way for ChatGPT to find the right set of numbers for the chunks of words. The second one… FAISS is a vector store (see above). This store (FAISS) has a function that creates a vector store from a set of documents and an embedding process (finding those sets of numbers). That’s it. It creates a database (vector store) from our documents and our embedding process.
So now what. We have a vector store of all of our documents, split up and indexed by this cool meaningful set of numbers. What do we do with that. A lot!
Sense those sets of numbers have a meaning, that they know all about what our documents; the meanings, the context and how it could be used in a sentence let’s find what our question’s sets of numbers might be.
To do this we follow the same logic — find the set of numbers that match our question. This goes to our vector store and we get the documents that make sense, literally have similar meanings to our question.
docs = db.similarity_search(question)
print(docs[0].page_content)
This just looks at the vector store with my actual question and searches for similar items using my same embedding process as before.
I hope this helps. Follow me for more interesting and helpful AI content.