![](https://crypto4nerd.com/wp-content/uploads/2023/05/0d-iAojaOuqUdM0Z-.jpg)
A list of data and AI newsletters to stay up-to-date
Accuracy is a bit naive since it attributes a value of 1 to correct predictions and a null cost to errors. On the other hand, F1-score is more like a black box: you will always need to reverse-engineer it to get its value matrix. Author suggests using a custom value matrix, depending on your specific application, setting according to actual economic impact
Pipeline: easy to get started with transformers
Sentiment analysis, text generation (max_length), zero shot classification (provide candidate labels to let model choose)
Tokenizer: tokenize to tokens, convert_tokens_to_ids (convert to token id)
Save/load custom model to/from directory: save_pretrained(directory), from_pretrained(directory)
tokenizer(text) gives back inputs_ids with starting and ending token id (others same as convert_tokens_to_ids method), also attention_mask
Hugging face offering overview (e.g. transformers library, model hub (NLP, ViT, speech), dataset, hosted inference api, spaces (showcases ML apps))
- Fixed-Length Chunks: split the text into fixed-length chunks of equal size. For example, if you have a text of 4000 tokens and you decide to use 500-token chunks, you will end up with 8 chunks. This approach is most straightforward but may result in chunks that break sentences or paragraphs in unnatural places.
- Sentence-based Chunks: split the text at the end of each sentence. By dividing the text based on sentence boundaries, you ensure that the chunks are grammatically coherent and maintain the flow of information. However, this method may result in chunks of varying lengths, and some sentences might be split across multiple chunks.
- Paragraph-based Chunks: Similar to sentence-based chunks, you can split the text at paragraph boundaries. This approach helps maintain the contextual integrity of the text and ensures that each chunk contains complete paragraphs. However, as with sentence-based chunks, the lengths of the chunks may vary.
- In above strategies, neighboring sentences may be split into different segments, resulting in context fragmentation problem. A straightforward solution is to permit the segments to overlap.
- Subheading-based Chunks: If the long text has subheadings or section headings, you can use them as natural breakpoints to split the text. This strategy ensures that each chunk corresponds to a specific topic or subtopic within the text, making it easier to maintain coherence and relevance within each chunk. This strategy could work well on well organized documents if section size is under token limit. However, if the section is too big, it will exceed the token limit.
Large Language Models
Author mentions some project ideas for LLM
- Cover Letter Generator to practice prompt engineering and using prompt templates
- Personalized chatbot with own data
- YouTube or Podcast Summarizer
- Web Scraper/Information Extractor
- Cognitive search of own Documents
- Question Answering over own Documents
- Clustering Documents into Topics or categories
ChatGPT beats traditional sentiment analysis and can explain decisions.
LLMs make lots of natural language tasks as text generation or next token prediction task
- “Identify whether this sentence has a positive or negative sentiment: <sentence>”
- “Translate the following sentence from English into French: <sentence>”
- “Summarize the following article: <article>”
However, specialized LLM is still needed in:
- Alignment (Prevent our LLM from being racist; Teach the model to follow and execute human directions; Avoid the generation of factually incorrect output)
- Domain Specialization
Examples
Codex is LLM specialized at code
LaMDA (Language Models for Dialog Applications)
OpenAI released some tool to visualize neurons in LLM for explainability.
Hallucinations in LLMs examples:
- Factual Inaccuracies: The LLM produces a statement that is factually incorrect.
- Unsupported Claims: The LLM generates a response that has no basis in the input or context.
- Nonsensical Statements: The LLM produces a response that doesn’t make sense or is unrelated to the context.
- Improbable Scenarios: The LLM generates a response that describes an implausible or highly unlikely event.
Reference for evaluation metrics for various NLP tasks like Language Modeling, Text Classification and Sentiment Analysis, Machine Translation, Text Summarization, Named Entity Recognition, Question Answering.
References for hallucination evaluation (active research area): Fact-checking Evaluation, Groundedness Evaluation, Reference-based Evaluation, Human Evaluation, Adversarial Evaluation, Contrastive Evaluation, Counterfactual Evaluation, Negative Training Examples, Evaluation Metrics that Penalize Hallucination, Fine-grained Evaluation, Safety Evaluation
The article mentions GPT model tree and mentioned 3 architectures (encoder-only, decoder-only, encoder-decoder), and explained why decoder-only/GPT model is winning (but the layers mentioned also exist in encoder-decoder model).
Google Research on supporting up to 64,000 tokens (compared to GPT-4 32,000 tokens)
Authors reviewed GPT-4 technical report for OpenAI evaluation contamination, e.g. 30% of LSAT evaluation data is in training data (like a student sees exam questions before taking them), while 39% of questions removed may contain the most difficult questions and we don’t know if a score of 167 is good or bad on this 61% LSAT.
Open source GPT -3 model by EleutherAI
March 2021 GPT-Neo: 2.7B parameters
June 2021 GPT-J: 6.7B
Feb 2022 GPT-NeoX:20B
The table in the article shows GPT-NeoX is 3%-10% lower than OpenAI’s Davinci (GPT-3 175B) on NLP benchmarks.
Open source GPT models and NLP tasks benchmark
GPT-J, GPT-NEOX vs GPT-3 NLP tasks benchmark for tasks, e.g. HellaSwag, TriviaQA, OpenbookQA
H2OGPT, you can follow open source repo to reproduce
- Open-source repository with fully permissive, commercially usable code, data, and models
- Code for preparing large open-source datasets as instruction datasets for fine-tuning large language models (LLMs), including prompt engineering
- Code for fine-tuning large language models (currently up to 20B parameters) on commodity hardware and enterprise GPU servers (single or multi-node)
- Code to run a chatbot on a GPU server, with a shareable end-point with Python client API
- Code to evaluate and compare the performance of fine-tuned LLMs
Some open source chat models for commercial usage (so no LLaMa based models), e.g. OpenAssistant, gpt4all-j, Dolly, mpt-7b, RedPajama
Base/Instruct/StoryWriter/Chat, MPT-7B-instruct is instruction following model. This new aggregate dataset, released here, was used to finetune MPT-7B, resulting in MPT-7B-Instruct, which is commercially usable. Anecdotally, we find MPT-7B-Instruct to be an effective instruction-follower. With its extensive training on 1 trillion tokens, MPT-7B-Instruct should be competitive with the larger dolly-v2–12b, whose base model, Pythia-12B, was only trained on 300 billion tokens. The context can be up to 65k, much larger than others.
It is using the same InstructionTextGenerationPipeline like Databricks Dolly, but with some difference, you cannot directly pass model_id into it (it will report model_id is str so no model.config attribute). If you want to use the one from Dolly, you need following code
model_name = "databricks/dolly-v2-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(model_name,device_map="auto",torch_dtype=torch.bfloat16)generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, max_new_tokens = 50)
The base model of StarCoder has 15.5 billion parameters and has been trained on a trillion tokens. StarCoder has several fine-tuned models with different purposes. One such model is Starchat-alpha, which is a coding assistant for Python code generation.
Pros: maximum prompt length of 8,000 tokens; StarCoder outperforms other famous open-source models like PaLM and LLaMA for coding capabilities from their evaluations by using HumanEval and MBPP benchmark methods; plugins with VS code and Jupyter
Cons: The base model starcoderbase and Python version starcoder are not instruction models and have not been trained to answer questions like an instruction model would. Fortunately, the fine-tuned model starchat-alpha can complete those instruction tasks by using a smart prompt engineering development.
Coding instruction prompt format
system_prompt = "<|system|>nBelow is a conversation between a human user and a helpful AI coding assistant.<|end|>n"
user_prompt = f"<|user|>n{input_prompt}<|end|>n" assistant_prompt = "<|assistant|>" full_prompt = system_prompt + user_prompt + assistant_prompt
Program-Aided Language Models prompt format (providing few-shot examples (question and solution))
prompt = '''
def solution():
#Ques: For Halloween Debby and her sister combined the candy they received. Debby had 32 pieces of candy while her sister had 42. If they ate 35 pieces the first night, how many pieces do they have left?
Debby_candies = 32
sister_candies = 42
candies_ate = 35
return ((Debby_candies + sister_candies) - candies_ate)
def solution():
#Ques: What are roots of the equation x^2 - 2x + 1?
import math
a = 1
b = -2
c = 1
root1 = (-b + math.sqrt(b**2 - 4*a*c)) / (2*a)
root2 = (-b - math.sqrt(b ** 2 - 4 * a * c)) / (2 * a)
return root1, root2
def solution():
#Ques: A waiter had 22 customers in his section. If 14 of them left and the rest of his tables had 4 people at each table, how many tables did he have?
customers = 32
customers_left = 14
each_table = 4
total_tables = (customers - customers_left) / each_table
return total_tables
def solution():
#Ques: What is the 5th number in Fibonacci sequence?
n = 5
a = 0
b = 1
if n < 0:
print("Incorrect input")
elif n == 0:
return a
elif n == 1:
return b
else:
for i in range(2, n):
c = a + b
a = b
b = c
return b
print(solution())
def solution():
#Ques: {question}
'''
Blog talks about using H2O LLM Studio — a framework and no-code GUI for fine-tuning LLMs with CSV that contains instruction and output column.
LLM-powered applications
AI-powered pandas
Showed Pandas AI prompts related to common data science tasks: Data selection, sorting, aggregation, reshaping/pivoting, clean/fill missing/remove duplicate, union, transformation/normlization, describe, time series analysis
Pandas AI at backend it is using openai endpoint to generate pandas code, matplotlib code from your natural language query, then use Python exec to run the code
AI-powered scikit-learn for text analysis, use cases:
- Use LLM to classify tabular dataset
- Zero-Shot classification (without labels, but need label itself to be expressed in natural language, descriptive, and self-explanatory.)
- Zero-Shot Multi-Label Text Classification
- Text vectorization
- Text summarization
The main piece is to translate pandas dataframe to a prompt using prompt template and pass to GPT.
1 of 7-part series on MLOps, introducing importance of a feature store, as a fancy database that adds the following features (some are overlapping with DataOps, but train/validation/test, offline/online, ML-specific feature transformation could be quite specific to machine learning):
- data versioning and lineage
- data validation
- the ability to create datasets
- the ability to hold train/validation/test splits
- two types of storage: offline (cheap, but high latency) and online (more expensive, but low latency).
- time-travel: easily access data given a time window
- hold feature transformation in addition to the feature themselves
- data monitoring, etc……
Generating Synthetic Relational Databases with Gretel Relational. The synthetic data can follow referential integrity, distribution, record count properties you specify.
Voice-enabled chatbot based on speech2text, LLM, langchain, text2speech, bentoml and gradio, the flow is:
- User’s audio input is converted to text using speech2text (OpenAI’s Whisper, processor, model)
- The converted text is sent to LLM for response
- The response text is converted to audio using text2speech (speecht5_tts, processor, model, vocode)
BentoML is used to define runners, service and API, gradio is used to create chatbot UI, langchain.chains is used to manage ConversationChain and abstract interaction with LLM (in this case, OpenAI GPT)
User audio tensors are generated from audio file using OpenAI whisper_processor.
Not every transformers model is the same. There are 3 types:
Encoder-Decoder: the encoder (on the left) processes the input sequence and generates a hidden representation that summarizes the input information. The decoder (on the right) uses this hidden representation to generate the desired output sequence. The encoder and decoder are trained end-to-end to maximize the likelihood of the correct output sequence given the input sequence. Example models: T5, BART, good for
- Translation
- Text summarization
- Question and answering
Encoder-only: the input sequence is encoded into a fixed-length representation and then used as input to a classifier or a regressor to make a prediction. These models have a pre-trained general-purpose encoder but will require fine-tuning of the final classifier or regressor. Example models: BERT, DistilBERT (BERT-based models?), good for
- Text classification
- Sentiment analysis
- Named entity recognition
Decoder-only: does not have an explicit encoder to summarize the input information. Instead, the information is encoded implicitly in the hidden state of the decoder, which is updated at each step of the generation process. Example models: GPT, Google LaMDA, OPT, BLOOM, good for
- Text completion
- Text generation
- Translation
- Question-Answering
- Generating image captions
https://aclanthology.org/2022.bionlp-1.37.pdf
In some NLP tasks, model is doing sth like: given the input sequence, what is the best and most likely target sequence (the maximum probability based on the source sentence). However, if algorithm is too greedy at current step (e.g. Greedy Search algorithm selects one best candidate as an input sequence for each time step. Choosing just one best candidate might be suitable for the current time step, but when we construct the full sentence, it may be a sub-optimal choice.), it may not be best choice overall.
The beam search algorithm selects multiple alternatives for an input sequence at each timestep based on conditional probability. The number of multiple alternatives depends on a parameter called Beam Width B. At each time step, the beam search selects B number of best alternatives with the highest probability as the most likely possible choices for the time step.
Step 1:Find the top 3 words with the highest probability given the input sentence. The number of most likely words are based on the beam width; Step 2: Find the three best pairs for the first and second words based on conditional probability; Find the three best pairs for the first, second and third word based on the input sentence and the chosen first and the second word.
Other good article 2, 3, 4 on beam search and references to other Transformers articles.
Here is ChatGPT explanation, also intuitive, although I cannot find where it got this.
Imagine you have a magic wand that can generate sentences for you. Let's say you want to use this wand to write a story. However, the wand can only generate one word at a time. So, you start with an initial word and want to figure out what word to generate next, and so on, until you have a complete sentence.
Now, imagine that you have a few different wands, and each wand can generate a word. This is similar to beam search, where "beam" refers to the number of wands or paths you consider at each step.At the beginning, you start with one wand and generate the first word. Then, instead of using only one wand, you create a few more wands, maybe three or four. Each wand will generate a different word. Now you have multiple options for the second word of your sentence.You look at the words generated by all the wands and decide which ones are the best. Maybe one of the wands generated a really interesting word, while the others produced less exciting options. You choose the most interesting word and keep it.Now, for the third word, you create new wands based on the word you chose. Each of these new wands generates a word that could follow the chosen word. Again, you evaluate all the words generated and select the most interesting one.You repeat this process for each subsequent word until you have a complete sentence. At each step, you consider multiple options, choose the best ones, and keep building on them. This is called beam search because you start with a small "beam" of options and keep narrowing it down until you reach the end.The idea behind beam search is to explore different possibilities and choose the most promising ones at each step, which helps in finding better sentences or solutions in natural language processing tasks.
Generative model but the core is transformers and tokenization
OCR-free transformer for document understanding
OCR-free document understanding
Author tested English to Cypher, Translation, complex medical concepts simplication in drug research domain, GPT-3 performance overpowers GPT-J
BERTopic topic modeling