The OpenAI Whisper API is an automatic speech recognition (ASR) system developed by OpenAI. Unlike OpenAI’s well-known chatbots, Whisper is not a chatbot. It is a model that can convert spoken audio into text in the original language (ASR) and also provide translations into English. Have you ever wished you could easily convert spoken words from lectures or meetings into text? Whisper is the perfect tool!
What is Whisper?
Whisper, developed by OpenAI, is an automatic speech recognition model trained on a substantial dataset of 680,000 hours of multilingual audio from the web. Its primary function is to transcribe audio files into text, effectively translating spoken words into written form. One of the key strengths of Whisper lies in its ability to handle a variety of audio conditions, including different accents, background noises, and technical jargon. The model supports transcription in 99 different languages, though its accuracy can vary across languages and dialects. You can submit audio files under 25MB in the following formats: mp3, mp4, mpeg, mpga, m4a, wav, and webm.
How to access and use Whisper?
Currently, Whisper is accessible exclusively through its Application Programming Interface (API). To use Whisper via the API, one must first obtain an API key from OpenAI. Accessing Whisper involves writing Python scripts that make requests to the API using this key. The process includes importing necessary libraries, setting up the API key for authorization, and sending audio files to the Whisper API for transcription. It’s important to note that while the API provides a structured way to access Whisper’s capabilities, it requires a basic understanding of programming and API usage. In the following explanations, we will explain how to use Whisper in Python. This part is partly based on an article on Whisper published by our colleagues from Tilburg Science Hub.
- The first step is to install Whisper on your computer. We do this using pip. It also requires the command-line tool ffmpeg. To install it, we just need one more line in the second codeblock.
pip install openai-whisper
ShellScript#for Windows using Chocolatey(https://chocolatey.org/)
choco install ffmpeg
#for Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
#for Mac using Homebrew (https://brew.sh/)
brew install ffmpeg
#for Ubuntu
sudo apt update && sudo apt install ffmpeg
#for Linux
sudo pacman -S ffmpeg
ShellScript- Import necessary libraries and setting API key:
import openai
import os
from pydub import AudioSegment
PythonThis step is importing the necessary Python libraries for the script to run. openai
is used to interact with OpenAI’s API, os
is a module that provides a way of using operating system dependent functionality and AudioSegment
from pydub
is used for audio file manipulation (which we will use later on to split the audio file)
- Set OpenAI API key:
openai.api_key = "[YOUR_OPENAI_API_KEY]"
PythonReplace [YOUR_OPENAI_API_KEY]
with your actual OpenAI API key, which is used to authenticate requests to the OpenAI API.
- Open an audio file and transcribe it:
audio_file= open("[PATH_TO_YOUR_AUDIOFILE]", "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)
PythonReplace [PATH_TO_YOUR_AUDIOFILE]
with the actual path to the audio file you want to transcribe. Then, the OpenAI API is called to transcribe the audio file using the "whisper-1"
model.
- Save the transcript to a text file:
transcript_text = transcript['text']
with open("[PATH_TO_DIRECTORY]/transcript.txt", 'w') as text_file:
text_file.write(transcript_text)
PythonThis step takes the transcribed text from the transcript
object, which is accessed via transcript['text']
, and writes it to a text file named transcript.txt
. You need to replace [PATH_TO_DIRECTORY]
with the actual directory path where you want to save the transcript.
Transcribing files larger than 25mb
Whisper currently has a maximum file size limit of 25 MB for audio files, which can be a significant constraint when you have longer recordings. However, a practical workaround exists: by dividing the audio file into smaller segments, you can effectively bypass this size restriction. This can be achieved using the AudioSegment
class from the pydub
library. By splitting the audio into parts you can manage larger files in manageable chunks.
# Getting only the first ten minutes of the audio file to transcribe
audio = AudioSegment.from_file("[PATH_TO_YOUR_AUDIOFILE]")
ten_minutes = 10 * 60 * 1000
first_10_minutes = audio[:ten_minutes]
# Saving the shorter audio file as mp4
first_10_minutes.export("[PATH_TO_DIRECTORY]/first_ten_minutes.mp4", format="mp4")
PythonThe task of manually splitting the file into ten-minute segments and transcribing each segment can be repetitive and time-consuming. We can automate this process with Python with the following code:
def transcribe_audio(path_to_audiofile, output_dir):
# Load the audio from your file
audio = AudioSegment.from_file(path_to_audiofile)
# Define the duration of each segment in milliseconds (10 minutes)
segment_duration = 10 * 60 * 1000
# Initialize variables for tracking the current position and the list of segments
current_position = 0
audio_segments = []
# Create the specified output directory for storing temporary audio segments if it doesn't exist
os.makedirs(output_dir, exist_ok=True)
# Continue splitting as long as there is enough audio left
while current_position + segment_duration <= len(audio):
# Extract the current 10-minute segment
segment = audio[current_position:current_position + segment_duration]
# Define the file path for the current segment
segment_path = os.path.join(output_dir, f"temp_segment_{len(audio_segments)}.mp4")
# Export the segment to the temporary file
segment.export(segment_path, format="mp4")
# Add the segment to the list
audio_segments.append(segment)
# Update the current position for the next iteration
current_position += segment_duration
# If there is any remaining audio less than 10 minutes, add it as the last segment
if current_position < len(audio):
remaining_segment = audio[current_position:]
# Define the file path for the remaining segment
remaining_segment_path = os.path.join(output_dir, f"temp_segment_{len(audio_segments)}.mp4")
# Export the remaining segment to the temporary file
remaining_segment.export(remaining_segment_path, format="mp4")
audio_segments.append(remaining_segment)
# Transcribe each audio segment and append the results to a list
transcribed_texts = []
for i, segment in enumerate(audio_segments):
# Get the file path for the current segment
segment_path = os.path.join(output_dir, f"temp_segment_{i}.mp4")
# Open the audio file as a binary file
with open(segment_path, 'rb') as audio_file:
# Transcribe the segment using OpenAI
transcript = openai.Audio.transcribe("whisper-1", file=audio_file)
transcript_text = transcript['text']
transcribed_texts.append(transcript_text)
# Print a "done" statement for the current segment
print(f"Segment {i + 1} transcribed!")
# Combine all transcribed texts into one document
combined_transcript = "\n".join(transcribed_texts)
# Write the combined transcript to a file
combined_transcript_path = os.path.join(output_dir, "combined_transcript.txt")
with open(combined_transcript_path, 'w') as text_file:
text_file.write(combined_transcript)
# Call the function with user-specified audio file path and output directory
user_audio_path = "[PATH_TO_YOUR_AUDIOFILE]"
user_output_dir = "[PATH_TO_DIRECTORY_FOR_OUTPUT]"
transcribe_audio(user_audio_path, user_output_dir)
PythonThe processes is now automated into a function with two parameters path_to_audiofile
and output_dir
. It works by first breaking the audio file into 10-minute segments, saving each segment as a separate file in the specified output directory, output_dir
. Then, each of these audio segments is transcribed into text using OpenAI’s audio transcription service. As each segment is transcribed, a message indicates its completion. Finally, the text transcriptions of all segments are combined into a single document, which is saved as a text file in the same output directory, output_dir
. The function is thus intended to be called with the path to the audio file path_to_audiofile
and the path to the desired output directory output_dir
provided by the user. There are placeholders for the paths to the audio file and the output directory: "[PATH_TO_YOUR_AUDIOFILE]"
and "[PATH_TO_DIRECTORY_FOR_OUTPUT]"
. These are meant to be replaced with the actual file paths relevant to your use case.
Improving the quality of the transcription
We can try to improve the quality of the transcript when we see that we do not have the desired output for some parts. The quality of the transcription can either be improved on the go or afterwards. On the go it can be done with prompting; the model will attempt to replicate the prompt’s style, thus if the prompt uses punctuation and capitalization, it will be more likely to do the same. Nevertheless, compared to our previous language models, the present prompting method is far more constrained and offers less flexibility over the audio that is produced.
# You may want to include specific words or acronyms in the prompt that the model may misrecognize in the audio
transcript = openai.Audio.transcribe("whisper-1", audio_file, prompt="[BRANDS_NAMES_EXAMPLE-TRANSCRIPT]")
PythonA second way to improve the quality of your transcription is by post-processing the transcript with GPT-4 or GPT-3.5. This can be done directly in Python (or you can copy and paste the transcript in ChatGPT and prompt via there).
# Post-processing the text
system_prompt_use = f"You are a helpful assistant for the company [YOUR_COMPANY_NAME]. Your task is to correct any spelling discrepancies in the transcribed text given below. Make sure that the names of the following persons and products are spelled correctly: [LIST_OF_WORDS]. Only add necessary punctuation such as periods, commas, and capitalization, and use only the context provided. /n /n {transcript_text}"
def generate_corrected_transcript(system_prompt):
response = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": system_prompt
}
]
)
return response.choices[0].message.content
corrected_text = generate_corrected_transcript(system_prompt_use)
print(corrected_text)
PythonBenefits for teachers and students
Using Whisper for transcription can be beneficial for both students and teachers. By automating the transcription process, Whisper saves time and eliminates the need for manually transcribing lengthy audio recordings. Here are some practical ways you can use Whisper:
Note-Taking During Lectures
Whisper can be used to transcribe online or in-person lectures, allowing students to focus on understanding the content rather than taking extensive notes. This feature is equally useful for teachers, as they can easily transcribe recorded lectures or presentations for their students.
Improved Documentation and Follow-Up in Meetings
Transcribing meetings with Whisper means a detailed and accurate written record of discussions. This makes it easier to track action items, decisions, and key points covered during the meeting. Accurate transcripts facilitate efficient follow-up, ensuring tasks and responsibilities are clearly documented and assigned. Additionally, if you miss a meeting, you can quickly catch up by reading the transcript.
Limitations
While Whisper is a great tool for transcribing audio files, it do have some limitations:
- Quality: Despite efforts to improve transcription quality, Whisper may have limitations in achieving perfect accuracy. Factors such as background noise, speaker accents, complex terminology, or rapid speech can still pose challenges. From our experience, even after trying to improve the quality with the practices explained in this article, results vary. This does not mean that you cannot get valuable information out of the transcripts, despite some mistakes.
- No identification of individual speakers: You should be aware that while Whisper can generate a plain text file of transcribed content, it typically doesn’t identify individual speakers. This limitation means that in multi-speaker scenarios, the transcribed text won’t attribute specific statements to particular participants, potentially requiring manual annotation or additional context to discern who said what during a conversation or meeting. This function is offered by AssemblyAI, which in this regard outperforms Whisper.