![](https://crypto4nerd.com/wp-content/uploads/2023/07/1lZyKWzdZcLy4EdeBVdFRWA-1024x684.jpeg)
Transformers version v4.29.0, building on the concept of tools and agents, provides a natural language API on top of transformers. How to use them? Let’s dive into them having language learning as an example of usage!
The “agent” here is a large language model, and we’re prompting it so that it has access to a specific set of tools.
LLMs are good at generating small samples of code, so this API takes advantage of that by prompting the LLM to give a small sample of code performing a task with a set of tools.
Tools are very simple: they’re a single function with a name and a description. We then use these tools’ descriptions to prompt the agent. Through the prompt, we show the agent how it would leverage tools to perform what was requested in the query. [Source]
What can tools do? A tool can answer a question on a given image or summarize a long text in one or a few sentences. Each tool is meant to be focused on one very simple task only.
Let’s run the first cell of the Google Colab notebook. Here we should choose the last version of transformers (v4.29.0) and then, when prompted, input the HuggingFace User Access Token.
#@title Setup
transformers_version = "v4.29.0" #@param ["main", "v4.29.0"] {allow-input: true}print(f"Setting up everything with transformers version {transformers_version}")
!pip install huggingface_hub>=0.14.1 git+https://github.com/huggingface/transformers@$transformers_version -q diffusers accelerate datasets torch soundfile sentencepiece opencv-python openai
import IPython
import soundfile as sf
def play_audio(audio):
sf.write("speech_converted.wav", audio.numpy(), samplerate=16000)
return IPython.display.Audio("speech_converted.wav")
from huggingface_hub import notebook_login
notebook_login()
Once the login is successful, we can move forward and initialize the agent, which is a large language model (LLM). The OpenAI could be used for the best results, but fully open-source models such as StarCoder or OpenAssistant are also available. In our demo, a StarCoder agent was used.
#@title Agent init
agent_name = "StarCoder (HF Token)" #@param ["StarCoder (HF Token)", "OpenAssistant (HF Token)", "OpenAI (API Key)"]import getpass
if agent_name == "StarCoder (HF Token)":
from transformers.tools import HfAgent
agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder")
print("StarCoder is initialized 💪")
elif agent_name == "OpenAssistant (HF Token)":
from transformers.tools import HfAgent
agent = HfAgent(url_endpoint="https://api-inference.huggingface.co/models/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")
print("OpenAssistant is initialized 💪")
if agent_name == "OpenAI (API Key)":
from transformers.tools import OpenAiAgent
pswd = getpass.getpass('OpenAI API key:')
agent = OpenAiAgent(model="text-davinci-003", api_key=pswd)
print("OpenAI is initialized 💪")
The agent is initialized! We have a world of plenty of different objects and images. And we can describe all that we see in our native language. But what if we want to learn a foreign language? Let’s use the full power of the tools of our agent. Let’s try to describe all things that you see in Spanish!
First of all, let’s take a picture and upload it.
# function to upload the image.def upload_files():
from google.colab import files
uploaded = files.upload()
for k, v in uploaded.items():
open(k, 'wb').write(v)
return list(uploaded.keys())
uploaded_photos = upload_files()
When an image is uploaded, let’s show it.
# let's read the uploaded imagefrom PIL import Image
# Read image
image = Image.open(uploaded_photos[0])
# Output Image
image.show()
In this example, we have some delicious food on the table.
Let’s try to run the agent to generate captions for the uploaded image (in English). If you’d like to hand objects (or previous results) to the agent, you can do so by passing a variable directly and mentioning between backticks the name of the variable passed. In our example, we are passing the uploaded image to the agent:
# Running agent for generating a caption for the image (in English)what_on_my_photo = agent.run("Generate a caption for the 'image'", image=image)
The agent-generated caption is an ‘a plate of food with eggs, hams, and toast’. Looks good!
The second step is to translate the generated caption into Spanish. If you wonder what other languages are available, check the full list on the underlying NLLB model page.
However, upon the check, not all languages claimed as supported by the link above were working. You can use the list of supported languages below:
list_of_supported_languages = [
'Afrikaans', 'Akan', 'Amharic', 'Armenian', 'Assamese', 'Asturian', 'Awadhi',
'Balinese', 'Bambara', 'Bashkir', 'Basque', 'Belarusian', 'Bemba', 'Bengali',
'Bhojpuri', 'Bosnian', 'Buginese', 'Bulgarian', 'Catalan', 'Cebuano',
'Central Kurdish', 'Chhattisgarhi', 'Chokwe', 'Crimean Tatar', 'Croatian',
'Czech', 'Danish', 'Dyula', 'Dzongkha', 'Esperanto', 'Estonian', 'Ewe',
'Faroese', 'Fijian', 'Finnish', 'Fon', 'French', 'Friulian', 'Galician',
'Ganda', 'Georgian', 'German', 'Greek', 'Guarani', 'Gujarati', 'Haitian Creole',
'Hausa', 'Hebrew', 'Hindi', 'Hungarian', 'Icelandic', 'Igobo', 'Indonesian',
'Irish', 'Italian', 'Japanese', 'Javanese', 'Kabyle', 'Kamba', 'Kannada',
'Kazakh', 'Khmer', 'Kikuyu', 'Kimbundu', 'Kinyarwanda', 'Korean', 'Kyrgyz',
'Lao', 'Latgalian', 'Ligurian', 'Limburgish', 'Lingala', 'Lithuanian',
'Lombard', 'Luba-Kasai', 'Luxembourgish', 'Mizo', 'North Azerbaijani',
'Scottish Gaelic', 'South Azerbaijani', 'Spanish', 'Swedish', 'Thai', 'Welsh']
For running a translation, we wrapped the agent.run
command into a simple function. The target language (in our case Spanish) should be typed in the input field.
# Running agent for translating the caption to a language of choicedef translate_captions(target_language, caption):
return agent.run(f"Can you translate 'caption' to {target_language}?", caption=caption)
target_language = input("Select a language: ")
target_language_caption = translate_captions(target_language=target_language, caption=what_on_my_photo)
The translated caption is ‘un plato de comida con huevos, jamones y tostadas’. Correct!
The third step is to generate audio in Spanish. We are prompting our agent accordingly:
# Running agent for generating an audio for translated captionaudio = agent.run("Read out loud the 'sentence'", sentence=target_language_caption, language=target_language)
play_audio(audio)
Unfortunately, the generated audio sounds a bit artificial; it’s unlikely that a native speaker would pronounce it like this. But we have a workaround: building a custom tool for voicing!
We will add a Google Text-to-Speech tool to read the text aloud in a given language. Let’s first install all the required packages.
# install all required packages !pip install gtts
!pip install langcodes
!pip install language_data
Then, let’s create a simple function that will handle the voicing. Take into account that this function needs a language code to work correctly. We will handle converting the language name to the language code later on.
from gtts import gTTS
from IPython.display import Audio# Text voicing function
def text_to_speech(text, lang):
tts = gTTS(text, lang=lang)
tts.save('output.mp3')
return Audio('output.mp3', autoplay=True)
Let’s create a tool that can be used by our agent! All tools depend on the superclass Tool
that holds the main attributes necessary. We’ll create a class that inherits from it. We need to specify the following:
- An attribute name that corresponds to the name of the tool itself. We will name it google_voicing_multiple_languages.
- An attribute description, which will be used to populate the prompt of the agent.
- Inputs and outputs attributes. Defining this will help the Python interpreter make educated choices about types. They’re both a list of expected values, which can be text, image, or audio.
- A call method that contains the inference code. Here we just call the
text_to_speech
function that was written above.
class VoicingInDifferentLanguages(Tool):
name = "google_voicing_multiple_languages"
description = ("This is a tool that can voice a word or a phrase in a given language. It takes a text and language as input, and returns the audio.")inputs = ["text", "text"]
outputs = ["audio"]
def __call__(self, text, language):
return text_to_speech(text, language)
Let’s first test the created tool directly. We are passing the Spanish word ‘comida’ as text
and the Spanish language code ‘es’ as language
to our custom tool:
tool = VoicingInDifferentLanguages()tool(text='comida', language='es')
To pass the tool to the agent, it is recommended to instantiate the agent with the tools directly. Take note that we are passing our custom tool in a list of additional tools to the agent:
from transformers.tools import HfAgentagent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder", additional_tools=[tool])
The following block of code makes the conversion of a language name to the language code (Spanish -> es):
# Handling the language name to language code convertionimport langcodes
target_language_code = str(langcodes.find(target_language))
Finally, let’s try to have the agent use our custom tool:
agent.run("Get voicing of 'text' in 'lang'", text=target_language_caption, lang=target_language_code)
It sounds like a native speaker’s voice!
The last thing to do is to push the tool to the Hub if you want to have others benefit from it.
Note: to be able to push to the hub, make sure that you are using a write
token and that you have created a dedicated space for pushing.
Save the code for the custom tool in a file and upload the file into colab Files:
Here is the code that should be placed in the file:
from transformers import Tool
from huggingface_hub import list_models
from gtts import gTTS
from IPython.display import Audio# Text voicing function
def text_to_speech(text, lang):
tts = gTTS(text, lang=lang)
tts.save('output.mp3')
return Audio('output.mp3', autoplay=True)
class VoicingInDifferentLanguages(Tool):
name = "google_voicing_multiple_languages"
description = ("This is a tool that can voice a word or a phrase in a given language. It takes a text and language as input, and returns the audio.")
inputs = ["text", "text"]
outputs = ["audio"]
def __call__(self, text, language):
return text_to_speech(text, language)
Let’s name this file model_downloads.py
, so the resulting import code looks like this:
# Optional: pushing the tool to the Hubfrom model_downloads import VoicingInDifferentLanguages
tool = VoicingInDifferentLanguages()
# change the name here to your space like owner/space-name
tool.push_to_hub("dashapetr/google-voicing-multiple-languages")
We can test if the pushed tool could be pulled and run successfully:
from transformers import load_toolloaded_tool = load_tool("dashapetr/google-voicing-multiple-languages")
loaded_tool(text='salut', language='fr')
In this blog post, we explored the concept of transformers agents and tools by learning a foreign language. We tried several curated tools for image captioning, translation, and text-to-speech conversion. Afterward, we created a custom tool for better text voicing. Finally, we uploaded the custom voicing tool to the hub.