Harnessing AI for Advanced Audio Processing: A Python Journey with Whisper | by Tony Esposito

In the era where audio content is as ubiquitous as text, analyzing and understanding it has become crucial. Today, I’m excited to share a Python application that showcases the power of Generative AI in processing and analyzing audio files. This application is not just a technical feat but a testament to how far we’ve come in the realm of machine learning and audio processing.

Our application leverages state-of-the-art AI models to perform two primary tasks on an audio file:

Transcription: Converts spoken words into written text.
Speaker Diarization: Identifies different speakers in the audio.

The application is particularly useful for processing interviews, podcasts, or any audio with multiple speakers, providing both the transcription and the segmentation of who said what.

First, we import necessary libraries and install whisper for transcription and pyannote.audio for speaker diarization.

# upload audio file
from google.colab import files
uploaded = files.upload()
path = next(iter(uploaded))
num_speakers = 2  # Number of speakers to identify# Install necessary libraries
!pip install -q git+https://github.com/openai/whisper.git > /dev/null
!pip install -q git+https://github.com/pyannote/pyannote-audio > /dev/null

Using OpenAI’s Whisper model, we transcribe the audio file into text.

import whispermodel = whisper.load_model(model_size)
result = model.transcribe(path)
segments = result["segments"]

We use pyannote.audio for speaker diarization. This involves extracting embeddings (numerical representations) for each audio segment.

import torch
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
embedding_model = PretrainedSpeakerEmbedding(
"speechbrain/spkrec-ecapa-voxceleb",
device=torch.device("cuda"))

Each segment of the audio is processed to extract embeddings. This is crucial for identifying different speakers.

# Define a function to get embeddings for a segment
def segment_embedding(segment):
# ... [Full function here as previously discussed]# Calculate embeddings for each segment
embeddings = np.zeros(shape=(len(segments), 192))
for l, segment in enumerate(segments):
embeddings[l] = segment_embedding(segment)
embeddings = np.nan_to_num(embeddings)

The embeddings are then used to cluster segments by speakers.

from sklearn.cluster import AgglomerativeClusteringclustering = AgglomerativeClustering(num_speakers).fit(embeddings)
labels = clustering.labels_

Finally, we compile the transcript, annotating each segment with the identified speaker.

# Annotate segments with speaker labels
for i in range(len(segments)):
segments[i]["speaker"] = 'SPEAKER ' + str(labels[i] + 1)# Write the transcript to a file
f = open("transcript.txt", "w")
for (i, segment) in enumerate(segments):
# ... [Writing logic here]
f.close()

This application stands as a remarkable example of how Generative AI can streamline complex tasks like audio processing and speaker diarization. As we continue to push the boundaries of what’s possible with AI, applications like these open new doors for content creators, researchers, and anyone interested in audio data analysis.

Source link