So far, when we’ve been using the OpenAI API, we have specified a text or coding prompt to the model and received the respective output. This has as we have seen opened up a lot of opportunities. However, The OpenAI Python package includes more models. In this example article, we will discuss Whisper
. With Whisper Audio recordings or videos can be transformed into text. While we have a more advanced article on our website, here we will cover the basics.
What are the benefits of Whisper
Using Whisper
for transcription can be beneficial for a student or teacher. Whisper saves time by automating the transcription process, instead of manually transcribing lengthy audio recordings. It can be used in practice in the following ways:
- Note-taking during lectures: Transcribe online lectures, allowing you to focus on understanding the content instead of taking extensive notes. Teachers can also benefit from this feature when recording their lectures or presentations.
- Improved documentation and follow-up during meetings: Transcribing meetings allows you to have a detailed and accurate written record of the meeting. This makes it easier for you to track action items, decisions, and key points discussed during the meeting. This can lead to efficient follow-up, ensuring that tasks and responsibilities are properly documented and assigned. It can also be beneficial for you if you missed a meeting and can easily read back the transcripts.
How to use Whisper: Step by Step Explanation
Step 1: Opening the Audio File
To transcribe an audio file using the Whisper
model, the first step is to open the file in a way that the model can understand. This is done using a simple Python
code snippet:
audio_file = open("PATH/TO/Your/FILE.mp3", "rb")
Here’s what this code does:
audio_file
is a variable that will store the file for further use.open("PATH/TO/Your/FILE.mp3", "rb")
is the function used to open the file:"PATH/TO/Your/FILE.mp3"
should be replaced with the actual location of your audio file."rb"
stands for “read binary” which means the file is read in a format suitable for non-text files like audio files.
Using the "rb"
mode is crucial because it ensures the file is read correctly as a stream of bytes, making it compatible with the Whisper
model for transcription.
File uploads are limited to 25 MB and the following input file types are supported: mp3
, mp4
, mpeg
, mpga
, m4a
, wav
, and webm
.
Step 2: Creating a Transcription Request
Next, we need to transcribe the audio file into text using the Whisper
model. This involves sending a request to the API with the audio file. Here’s the relevant Python
code snippet:
# Create a transcript from the audio file
response = client.audio.transcriptions.create(
model = "whisper-1",
file = audio_file)
Here’s a breakdown of what this code does:
response
is a variable that will store the result from the API.client.audio.transcriptions.create()
is the function call that sends the audio file to Whisper for transcription:model = "whisper-1"
specifies the model to use for transcription.file = audio_file
sends the opened audio file.
The transcription result is returned and stored in the response
variable. To save the actual text of the transcription, use the following code:
transcript = response.text
Great, You have made great progress in becoming a prompt engineer by using a new model: Whisper
! You can use it to transcribe online lectures or meetings! For a more in-depth look check out this earlier article on the website or download the template code underneath:
Model Chaining
We can extend the use of whisper
with the text models, by performing model chaining
. Chaining is when models are combined by feeding the output from one model directly into another model as an input. We can chain multiple calls to the same model together or use different models.
If we chain two text models together, we can ask the model to perform a task in one call to the API and send it back with an additional instruction. We can also combine two different types of models, like the Whisper
model and a text model
, to perform tasks like summarizing lectures and videos.
Let’s look at whether we can use the transcript we have created in a language model!
Inserting the Transcript and Multi-Step Prompting
For this example, we will construct a multi-step prompt that outlines the steps you want the model to follow. Here, we specify three steps:
prompt = f"""Transform the uploaded transcript, between parentheses with the following three steps:
Step 1 - Proofread it without changing its structure
Step 2 - If it is not in English, translate it to English
Step 3 - Summarize it at a depth level that is appropriate for a university exam
```{transcript}```"""
Great, let’s look at this specific prompt. It uses an action verb, it is specific, it clearly specifies the different parts of the prompt (the prompt itself and the transcript), and finally, it is also a conditional prompt. Which adds an extra translation element to the prompt. Implicitly we are going over the fact, that with our multistep prompt, we are doing more than one thing add the time. And in terms of API usage, we also use an f-string. You are becoming proficient in prompt engineering.
Now we have circled back to the things we have done throughout our series! Let’s finish our example, by showcasing how to combine the different steps. Here’s the full example, where we used a lecture video from YouTube!
from openai import OpenAI
import os
client = OpenAI(api_key= os.environ.get("OPENAI_API_KEY"))
# Open the audio.wav file
audio_file = open("SPECIFY YOUR PATH/Brand equity in the hearts and minds of consumers.mp4", "rb")
# Create a transcription request using audio_file
audio_response = client.audio.transcriptions.create(
model = "whisper-1",
file = audio_file
)
# Save the transcript
transcript = audio_response.text
# Chain your prompt with the transcript
## Here we have engineered our prompt: Multi-step Prompting
prompt = f"""Transform the uploaded transcript, between parentheses with the following three steps:
Step 1 - Proofread it without changing its structure
Step 2 - If it is not in English, translate it to English
Step 3 - Summarize it at a depth level that is appropriate for a university exam
```{transcript}```"""
# Create a request to the API to identify the language spoken
chat_response = client.chat.completions.create(
model = "gpt-4o",
messages = [
{"role":"system",
"content": "Act as a helpful assistant"},
{"role":"user",
"content":prompt}
],
temperature = 0)
print(chat_response.choices[0].message.content)
By following these steps, you have learned how to transcribe an audio or video file and process the transcript using multi-step prompting to achieve the desired output. Those transcribing days are finally over ; )