![](https://crypto4nerd.com/wp-content/uploads/2023/06/1QCd_gSVSf3N3KCsGcWYL8w.png)
Splitter, Tokeniser, Embedding, and LLM
Once the raw PDF has been loaded into an in-memory list, it’ll be chunked by a splitter and then embedded into a high-dimensional vector. Finally, the vector will be stored in a vector database.
Here, we are going to try both OpenAI and Google VertexAI implementations. The code piece is:
from langchain.llms import OpenAI, VertexAI
from langchain.chains import RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter
from langchain.embeddings import OpenAIEmbeddings, VertexAIEmbeddings
from langchain.vectorstores import Chromadef build_qa_chain(platform: str = 'openai', chunk_size: int = 1000, chunk_overlap: int = 50) -> RetrievalQA:
if platform == 'openai':
embedding = OpenAIEmbeddings()
splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
# splitter = CharacterTextSplitter(chunk_size=5000, chunk_overlap=0)
llm = OpenAI(model_name="text-davinci-003",
temperature=0.9,
max_tokens=256)
elif platform == 'palm':
embedding = VertexAIEmbeddings()
splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
llm = VertexAI(model_name="text-bison@001",
project='<your own GCP project_id>',
temperature=0.9,
top_p=0,
top_k=1,
max_output_tokens=256)
Here, we will use the OpenAI text-davinci-003 model and the Google PaLM text model text-bison@001 for comparison. Both models have input token length limits. The input token limit for text-davinci-003 is 4097 and 8196 for text-bison@001.
Because eventually we will need to send the chunks along with the prompt and question to the LLM, we must limit the size of the sliced chunks. We implement this by using:
CharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
We can specify the token length limit and whether we want the chunk to be overlapped to make sure there’s no information loss caused by the chunking. The embedding is done by either OpenAIEmbeddings() or VertexAIEmbeddings() according to the chosen environment.
For the OpenAI environment, the default embedding model is text-embedding-ada-002, which produces a vector of 1536 dimensions. For the GCP Vertex AI environment, the default embedding model is embedding-gecko-001, which produces a vector of 768 dimensions.
Please note that embedding models are less capable than other LLMs. For example, embedding-gecko-001 was optimised for embedding up to 1024 token inputs, while the limit for text-embedding-ada-002 is 8197 tokens. This is another important factor to consider when choosing the right chunk size.