Machine Learning stories roundup 2023.6 | by Xin Cheng

Machine learning articles that are interesting to read

A list of data and AI newsletters to stay up-to-date

Accuracy is a bit naive since it attributes a value of 1 to correct predictions and a null cost to errors. On the other hand, F1-score is more like a black box: you will always need to reverse-engineer it to get its value matrix. Author suggests using a custom value matrix, depending on your specific application, setting according to actual economic impact

Pipeline: easy to get started with transformers

Sentiment analysis, text generation (max_length), zero shot classification (provide candidate labels to let model choose)

Tokenizer: tokenize to tokens, convert_tokens_to_ids (convert to token id)

Save/load custom model to/from directory: save_pretrained(directory), from_pretrained(directory)

tokenizer(text) gives back inputs_ids with starting and ending token id (others same as convert_tokens_to_ids method), also attention_mask

Hugging face offering overview (e.g. transformers library, model hub (NLP, ViT, speech), dataset, hosted inference api, spaces (showcases ML apps))

Fixed-Length Chunks: split the text into fixed-length chunks of equal size. For example, if you have a text of 4000 tokens and you decide to use 500-token chunks, you will end up with 8 chunks. This approach is most straightforward but may result in chunks that break sentences or paragraphs in unnatural places.
Sentence-based Chunks: split the text at the end of each sentence. By dividing the text based on sentence boundaries, you ensure that the chunks are grammatically coherent and maintain the flow of information. However, this method may result in chunks of varying lengths, and some sentences might be split across multiple chunks.
Paragraph-based Chunks: Similar to sentence-based chunks, you can split the text at paragraph boundaries. This approach helps maintain the contextual integrity of the text and ensures that each chunk contains complete paragraphs. However, as with sentence-based chunks, the lengths of the chunks may vary.
In above strategies, neighboring sentences may be split into different segments, resulting in context fragmentation problem. A straightforward solution is to permit the segments to overlap.
Subheading-based Chunks: If the long text has subheadings or section headings, you can use them as natural breakpoints to split the text. This strategy ensures that each chunk corresponds to a specific topic or subtopic within the text, making it easier to maintain coherence and relevance within each chunk. This strategy could work well on well organized documents if section size is under token limit. However, if the section is too big, it will exceed the token limit.

Large Language Models

Author mentions some project ideas for LLM

Cover Letter Generator to practice prompt engineering and using prompt templates
Personalized chatbot with own data
YouTube or Podcast Summarizer
Web Scraper/Information Extractor
Cognitive search of own Documents
Question Answering over own Documents
Clustering Documents into Topics or categories

ChatGPT beats traditional sentiment analysis and can explain decisions.

LLMs make lots of natural language tasks as text generation or next token prediction task

“Identify whether this sentence has a positive or negative sentiment: <sentence>”
“Translate the following sentence from English into French: <sentence>”
“Summarize the following article: <article>”

However, specialized LLM is still needed in:

Alignment (Prevent our LLM from being racist; Teach the model to follow and execute human directions; Avoid the generation of factually incorrect output)
Domain Specialization

Examples

Codex is LLM specialized at code

LaMDA (Language Models for Dialog Applications)

OpenAI released some tool to visualize neurons in LLM for explainability.

Hallucinations in LLMs examples:

Factual Inaccuracies: The LLM produces a statement that is factually incorrect.
Unsupported Claims: The LLM generates a response that has no basis in the input or context.
Nonsensical Statements: The LLM produces a response that doesn’t make sense or is unrelated to the context.
Improbable Scenarios: The LLM generates a response that describes an implausible or highly unlikely event.

Reference for evaluation metrics for various NLP tasks like Language Modeling, Text Classification and Sentiment Analysis, Machine Translation, Text Summarization, Named Entity Recognition, Question Answering.

References for hallucination evaluation (active research area): Fact-checking Evaluation, Groundedness Evaluation, Reference-based Evaluation, Human Evaluation, Adversarial Evaluation, Contrastive Evaluation, Counterfactual Evaluation, Negative Training Examples, Evaluation Metrics that Penalize Hallucination, Fine-grained Evaluation, Safety Evaluation

The article mentions GPT model tree and mentioned 3 architectures (encoder-only, decoder-only, encoder-decoder), and explained why decoder-only/GPT model is winning (but the layers mentioned also exist in encoder-decoder model).

Google Research on supporting up to 64,000 tokens (compared to GPT-4 32,000 tokens)

Authors reviewed GPT-4 technical report for OpenAI evaluation contamination, e.g. 30% of LSAT evaluation data is in training data (like a student sees exam questions before taking them), while 39% of questions removed may contain the most difficult questions and we don’t know if a score of 167 is good or bad on this 61% LSAT.

Open source GPT -3 model by EleutherAI

March 2021 GPT-Neo: 2.7B parameters

June 2021 GPT-J: 6.7B

Feb 2022 GPT-NeoX:20B

The table in the article shows GPT-NeoX is 3%-10% lower than OpenAI’s Davinci (GPT-3 175B) on NLP benchmarks.

Open source GPT models and NLP tasks benchmark

GPT-J, GPT-NEOX vs GPT-3 NLP tasks benchmark for tasks, e.g. HellaSwag, TriviaQA, OpenbookQA

H2OGPT, you can follow open source repo to reproduce

Open-source repository with fully permissive, commercially usable code, data, and models
Code for preparing large open-source datasets as instruction datasets for fine-tuning large language models (LLMs), including prompt engineering
Code for fine-tuning large language models (currently up to 20B parameters) on commodity hardware and enterprise GPU servers (single or multi-node)
Code to run a chatbot on a GPU server, with a shareable end-point with Python client API
Code to evaluate and compare the performance of fine-tuned LLMs

Some open source chat models for commercial usage (so no LLaMa based models), e.g. OpenAssistant, gpt4all-j, Dolly, mpt-7b, RedPajama

Base/Instruct/StoryWriter/Chat, MPT-7B-instruct is instruction following model. This new aggregate dataset, released here, was used to finetune MPT-7B, resulting in MPT-7B-Instruct, which is commercially usable. Anecdotally, we find MPT-7B-Instruct to be an effective instruction-follower. With its extensive training on 1 trillion tokens, MPT-7B-Instruct should be competitive with the larger dolly-v2–12b, whose base model, Pythia-12B, was only trained on 300 billion tokens. The context can be up to 65k, much larger than others.

It is using the same InstructionTextGenerationPipeline like Databricks Dolly, but with some difference, you cannot directly pass model_id into it (it will report model_id is str so no model.config attribute). If you want to use the one from Dolly, you need following code

model_name = "databricks/dolly-v2-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

model = AutoModelForCausalLM.from_pretrained(model_name,device_map="auto",torch_dtype=torch.bfloat16)generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, max_new_tokens = 50)

The base model of StarCoder has 15.5 billion parameters and has been trained on a trillion tokens. StarCoder has several fine-tuned models with different purposes. One such model is Starchat-alpha, which is a coding assistant for Python code generation.

Pros: maximum prompt length of 8,000 tokens; StarCoder outperforms other famous open-source models like PaLM and LLaMA for coding capabilities from their evaluations by using HumanEval and MBPP benchmark methods; plugins with VS code and Jupyter

Cons: The base model starcoderbase and Python version starcoder are not instruction models and have not been trained to answer questions like an instruction model would. Fortunately, the fine-tuned model starchat-alpha can complete those instruction tasks by using a smart prompt engineering development.

Coding instruction prompt format

system_prompt = "<|system|>nBelow is a conversation between a human user and a helpful AI coding assistant.<|end|>n"

    user_prompt = f"<|user|>n{input_prompt}<|end|>n"    assistant_prompt = "<|assistant|>"    full_prompt = system_prompt + user_prompt + assistant_prompt

Program-Aided Language Models prompt format (providing few-shot examples (question and solution))

prompt = '''
def solution():
#Ques: For Halloween Debby and her sister combined the candy they received. Debby had 32 pieces of candy while her sister had 42. If they ate 35 pieces the first night, how many pieces do they have left?
Debby_candies = 32
sister_candies = 42
candies_ate = 35
return ((Debby_candies + sister_candies) - candies_ate)

def solution():
#Ques: What are roots of the equation x^2 - 2x + 1?
import math
a = 1
b = -2
c = 1
root1 = (-b + math.sqrt(b**2 - 4*a*c)) / (2*a)
root2 = (-b - math.sqrt(b ** 2 - 4 * a * c)) / (2 * a)
return root1, root2
def solution():
#Ques: A waiter had 22 customers in his section. If 14 of them left and the rest of his tables had 4 people at each table, how many tables did he have?
customers = 32
customers_left = 14
each_table = 4
total_tables = (customers - customers_left) / each_table
return total_tables
def solution():
#Ques: What is the 5th number in Fibonacci sequence?
n = 5
a = 0
b = 1
if n < 0:
print("Incorrect input")
elif n == 0:
return a
elif n == 1:
return b
else:
for i in range(2, n):
c = a + b
a = b
b = c
return b        
print(solution())
def solution():
#Ques: {question}
'''

Blog talks about using H2O LLM Studio — a framework and no-code GUI for fine-tuning LLMs with CSV that contains instruction and output column.

LLM-powered applications

AI-powered pandas

Showed Pandas AI prompts related to common data science tasks: Data selection, sorting, aggregation, reshaping/pivoting, clean/fill missing/remove duplicate, union, transformation/normlization, describe, time series analysis

Pandas AI at backend it is using openai endpoint to generate pandas code, matplotlib code from your natural language query, then use Python exec to run the code

AI-powered scikit-learn for text analysis, use cases:

Use LLM to classify tabular dataset
Zero-Shot classification (without labels, but need label itself to be expressed in natural language, descriptive, and self-explanatory.)
Zero-Shot Multi-Label Text Classification
Text vectorization
Text summarization

The main piece is to translate pandas dataframe to a prompt using prompt template and pass to GPT.

1 of 7-part series on MLOps, introducing importance of a feature store, as a fancy database that adds the following features (some are overlapping with DataOps, but train/validation/test, offline/online, ML-specific feature transformation could be quite specific to machine learning):

data versioning and lineage
data validation
the ability to create datasets
the ability to hold train/validation/test splits
two types of storage: offline (cheap, but high latency) and online (more expensive, but low latency).
time-travel: easily access data given a time window
hold feature transformation in addition to the feature themselves
data monitoring, etc……

Generating Synthetic Relational Databases with Gretel Relational. The synthetic data can follow referential integrity, distribution, record count properties you specify.

Voice-enabled chatbot based on speech2text, LLM, langchain, text2speech, bentoml and gradio, the flow is:

User’s audio input is converted to text using speech2text (OpenAI’s Whisper, processor, model)
The converted text is sent to LLM for response
The response text is converted to audio using text2speech (speecht5_tts, processor, model, vocode)

BentoML is used to define runners, service and API, gradio is used to create chatbot UI, langchain.chains is used to manage ConversationChain and abstract interaction with LLM (in this case, OpenAI GPT)

User audio tensors are generated from audio file using OpenAI whisper_processor.

Not every transformers model is the same. There are 3 types:

Encoder-Decoder: the encoder (on the left) processes the input sequence and generates a hidden representation that summarizes the input information. The decoder (on the right) uses this hidden representation to generate the desired output sequence. The encoder and decoder are trained end-to-end to maximize the likelihood of the correct output sequence given the input sequence. Example models: T5, BART, good for

Translation
Text summarization
Question and answering

Encoder-only: the input sequence is encoded into a fixed-length representation and then used as input to a classifier or a regressor to make a prediction. These models have a pre-trained general-purpose encoder but will require fine-tuning of the final classifier or regressor. Example models: BERT, DistilBERT (BERT-based models?), good for

Text classification
Sentiment analysis
Named entity recognition

Decoder-only: does not have an explicit encoder to summarize the input information. Instead, the information is encoded implicitly in the hidden state of the decoder, which is updated at each step of the generation process. Example models: GPT, Google LaMDA, OPT, BLOOM, good for

Text completion
Text generation
Translation
Question-Answering
Generating image captions

https://aclanthology.org/2022.bionlp-1.37.pdf

In some NLP tasks, model is doing sth like: given the input sequence, what is the best and most likely target sequence (the maximum probability based on the source sentence). However, if algorithm is too greedy at current step (e.g. Greedy Search algorithm selects one best candidate as an input sequence for each time step. Choosing just one best candidate might be suitable for the current time step, but when we construct the full sentence, it may be a sub-optimal choice.), it may not be best choice overall.

The beam search algorithm selects multiple alternatives for an input sequence at each timestep based on conditional probability. The number of multiple alternatives depends on a parameter called Beam Width B. At each time step, the beam search selects B number of best alternatives with the highest probability as the most likely possible choices for the time step.

Step 1:Find the top 3 words with the highest probability given the input sentence. The number of most likely words are based on the beam width; Step 2: Find the three best pairs for the first and second words based on conditional probability; Find the three best pairs for the first, second and third word based on the input sentence and the chosen first and the second word.

Other good article 2, 3, 4 on beam search and references to other Transformers articles.

Here is ChatGPT explanation, also intuitive, although I cannot find where it got this.

Imagine you have a magic wand that can generate sentences for you. Let's say you want to use this wand to write a story. However, the wand can only generate one word at a time. So, you start with an initial word and want to figure out what word to generate next, and so on, until you have a complete sentence.

Now, imagine that you have a few different wands, and each wand can generate a word. This is similar to beam search, where "beam" refers to the number of wands or paths you consider at each step.At the beginning, you start with one wand and generate the first word. Then, instead of using only one wand, you create a few more wands, maybe three or four. Each wand will generate a different word. Now you have multiple options for the second word of your sentence.You look at the words generated by all the wands and decide which ones are the best. Maybe one of the wands generated a really interesting word, while the others produced less exciting options. You choose the most interesting word and keep it.Now, for the third word, you create new wands based on the word you chose. Each of these new wands generates a word that could follow the chosen word. Again, you evaluate all the words generated and select the most interesting one.You repeat this process for each subsequent word until you have a complete sentence. At each step, you consider multiple options, choose the best ones, and keep building on them. This is called beam search because you start with a small "beam" of options and keep narrowing it down until you reach the end.The idea behind beam search is to explore different possibilities and choose the most promising ones at each step, which helps in finding better sentences or solutions in natural language processing tasks.

Generative model but the core is transformers and tokenization

OCR-free transformer for document understanding

OCR-free document understanding

Author tested English to Cypher, Translation, complex medical concepts simplication in drug research domain, GPT-3 performance overpowers GPT-J

BERTopic topic modeling

Source link

Large Language Models

LLM-powered applications

Leave a Reply Cancel reply

Related Stories

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

VC-Dimension V.S. Inductive Bias V.S. Biology V.S. Physical Laws : Comprehensive Multi-Disciplinary Table of Machine Learning Classifiers | by Medium_AI_CS_ML | Apr, 2024

Why Machine Learning Is Worth Talking About? | by jupytermishra | Apr, 2024

You may have missed

The Weekly Reorg: Bitcoin Fashion Week

Virtual curating frees artist – Hypergrid Business

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

Azteco Is Helping Millions Buy Bitcoin Without Sharing Their Identity