Evaluating automatic speech recognition models beyond global metrics — A tutorial using OpenAI’s Whisper as an example | by Daniel Klitzke

Global metrics are not enough to know if your automatic speech recognition model works well in your use case. There are three steps that can help you get a better feel for how robust your model really is:

Check which features cause model failures (uni/bivariate feature check)
Check which data slices (high-dimensional feature combinations) cause model failures
Check which hidden data slices cause model failures based on embeddings of your raw data

Want to do the same on your data, consider using our library sliceguard or visualize your data using our data curation software Spotlight.

Modern automatic speech recognition models like OpenAI’s Whisper promise great performance while generalizing very well to a wide variety of languages, speakers, and recording scenarios. This makes it tempting to just throw those models on your data and hope for the best. However, can we really trust those awesome global performance metrics? In this post, we will find out and additionally give you a tutorial on how you evaluate automatic speech recognition models for your use case.
The focus here will lie on identifying critical data slices that cause model failures, looking beyond global evaluation metrics like word error rate.

Visualization on the size of the data slices found by sliceguard, including their distribution within the clustering hierarchy.

Concretely, we will walk you through this by performing an evaluation of OpenAI’s Whisper (Medium) model on a subset (50k samples) of Mozilla’s Common Voice dataset.
We will do this by leveraging our Open Source Library sliceguard, which is built to automatically detect critical data slices while offering interactive reporting functionality using renumics-spotlight.

It processes Pandas DataFrames and for this tutorial we will use the following format:

| sentence      | age      | gender | accent      | prediction    | audio |                                            |
|:--------------|:---------|:-------|:------------|:--------------|:------|
| This posit... | twenties | male   | Canadian... | This posit... | 1.wav |
| Multiple M... | sixties  | female | United S... | Multiple M... | 2.wav |
| The name H... | twenties | male   | United S... | The name H... | 3.wav |
| The death ... | twenties | male   | United S... | The death ... | 4.wav |
| The study ... | twenties | female | India a...  | The study ... | 5.wav |

Convert your data to match the described format if you want to follow along on your own data. Note that age, gender, and accent are just some examples of scalar, structured features. In your case, this can, of course, vary, and you can easily adapt this tutorial. Most important are the columns sentence, containing the ground truth, and the column prediction, containing the model prediction.

First thing you want to do is checking if there are certain features values that cause your model to fail. For the univariate and bivariate case this is easiest by looking at your data from a feature perspective. If you have not done that now, please install sliceguard which will do the hard work for us:

pip install sliceguard

To run a feature-based check just call the find_issues method with just a single feature:

sg = SliceGuard()
issue_df = sg.find_issues(
df,
["accent"],
"sentence",
"prediction",
wer_metric,
metric_mode="min",
min_support=10,
min_drop=0.04,
)

The method takes the following arguments:

df is the dataframe containing all data
accent is the feature column to run the check on
sentence is the ground truth column
prediction is the prediction column
wer_metric is the metric function in the form metric_function(y_true, y_pred) corresponding to scikit-learn
metric_mode determines if the metric is optimized for maximization or minimization

The method will perform all the necessary preprocessing (e.g. one hot encoding categorical features) and return a dataframe where problematic are marked in the following format:

| issue | issue_metric | issue_explanation     |
|:------|:-------------|:----------------------|
| 2     | 0.15         | 0.15 -> accent (1.00) |
| 2     | 0.15         | 0.15 -> accent (1.00) |
| -1    | NaN          |                       |
| 2     | 0.15         | 0.15 -> accent (1.00) |
| 1     | 0.19         | 0.19 -> accent (1.00) |

Basically issue is an identifier for each critical data slice, the issue_metric is the metric computed on the data slice, and issue_explanation tries to name the most relevant features that are likely to cause the issue.

Note that there are also certain parameters that significantly influence which kind of issues the library is showing you. Low values for min_support and high values for min_drop will tend to show you outliers and significantly underrepresented samples with really bad model performance. High values for min_support and lower values for min_drop will tend to show you fairness issues such as unwanted bias. For a complete analysis incorporate both!

An interactive report can be viewed by calling sg.report().

This will open Renumics Spotlight for exploring the detected issues.

By running repeating the analysis for the three features age, gender and accent we get the following main insights:

There are certain English accents like Scottish and Liverpool English that are transcribed significantly worse.
There is a large portion of the dataset spoken by people with an Indian/Pakistani accent where the model performs worse.
There are certain outliers like Israeli English that are very underrepresented and are thus also recognized badly.
There are certain age groups towards the edge of the range that work worse, such as the age classes “teens” and “eighties”.

Our univariate analysis might have been already helpful, however in practice there is often certain groups that are characterized by a combination of multiple features and cannot be detected that easy. With sliceguard this is not an issue as we can just supply it with a list of multiple features. Let’s say we just do our analysis with age, gender and accent all at once:

issue_df = sg.find_issues(
df,
["age", "gender", "accent"],
"sentence",
"prediction",
wer_metric,
metric_mode="min",
feature_types={"age": "ordinal"},
feature_orders={
"age": [
"teens",
"twenties",
"thirties",
"fourties",
"fifties",
"sixties",
"seventies",
"eighties",
"nineties",
]
},
min_support=10,
min_drop=0.04,
)
sg.report()

By analyzing the found issues in Spotlight and using the generated explanations as a first hint we suddenly get a way more detailed view of the data. Some insights we gain are:

The problems that can be seen with Scottish accents are mostly caused by one Scottish female in her fourties. For other speakers the accuracy drop is not as high.
There is a cluster of a speakers with Indian/Pakistani accent who are female that is recognized way worse than their male counterparts.
There is a fairly large cluster of a speaker with an Australian accent in teenage that is performing quite badly compared to other Australian speakers. Similar issues exist also for other accents such as Singaporean English.

As you can see while univariate analysis is an important part without performing a higher-dimensional analysis we are prone to draw wrong conclusions and miss the big picture. Thus you should conduct both!

Now it gets even more interesting. While it is relatively easy to conduct these checks on structured data with explicitly given feature values finding hidden data slices poses an even harder challenge. This mostly applies to unstructured data such as images, audio or text which can not be sufficiently characterized by looking at single feature values or metadata. But embeddings to the rescue! In sliceguard you can simply run checks on columns containing raw data as well:

issue_df = sg.find_issues(
df,
["sentence"],
"sentence",
"prediction",
wer_metric,
metric_mode="min",
min_support=10,
min_drop=0.04,
)
sg.report()

sentence here contains the raw text, but those could also be paths to images or audio files. Sliceguard will just compute embeddings automatically and run its checks on them. We can also run the same thing on the audio column.

This gives us the following insights:

There are certain names of locations or persons that are misspelled by Whisper.
Quotes are not correctly put in quotation marks by Whisper which causes a drop in the evaluation metric.
Especially the problem of misspelling certain difficult names of locations/people increases if the audio recording is of bad quality or the speaker has a strong accent. There are also some extreme outliers where the audio is simply really bad, or there are other speakers in the room that cause the model to fail.
Some speakers apparently decided it would be good to whisper certain recordings instead of speaking in a normal voice. This causes a significant accuracy drop.

As you can see, it can definitely make sense to check whether the capabilities of the model really match your expected production setting.

If you feel like this tutorial can speed up your model debugging, just give the library a try by visiting our GitHub Repo. Also, let us know if you find any issues or have any feature requests that would make it applicable to your use case.

We look forward to making the library more mature and will soon also publish a public roadmap as well as more example applications. Let us know if you have any ideas for that.

Source link

Leave a Reply Cancel reply

Related Stories

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

VC-Dimension V.S. Inductive Bias V.S. Biology V.S. Physical Laws : Comprehensive Multi-Disciplinary Table of Machine Learning Classifiers | by Medium_AI_CS_ML | Apr, 2024

Why Machine Learning Is Worth Talking About? | by jupytermishra | Apr, 2024

You may have missed

The Weekly Reorg: Bitcoin Fashion Week

Virtual curating frees artist – Hypergrid Business

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

Azteco Is Helping Millions Buy Bitcoin Without Sharing Their Identity