In a previous article, we discussed how to access the OpenAI API in R. This article extends this application to show how to not only extract the text from PDF documents but how to extract the precise information you want without even reading the whole document. This article contains R code for extracting text from PDF documents and then retrieving information from them using the OpenAI API that generates answers based on user-defined prompts. Therefore, this approach allows for direct interaction with academic papers. This means that through the use of targeted prompts, you can extract specific snippets of information. This article’s code is based on the following source.
To illustrate the capabilities of the OpenAI API in R, we will use a PDF document from the Nobel Prize committee regarding the 2023 Prize in Economic Sciences awarded to Claudia Goldin. The code is aimed to gain insights into her contributions helping us understand women’s labor market outcomes, all without directly reading the document.
This template can be used in a flexible manner. It gives the ability to extract information from papers without the need to read them, reducing the time traditionally required for these tasks. Thereby, you can achieve a more thorough understanding of the paper. The API can answer the questions, about sections you don’t understand, the specific methodology, or the limitations, etc. As such, the template code can be changed in the way you like, and we highly recommend it to customize it according to your specific research needs.
Requirements
- R and RStudio Installation: Your system should have R and preferably RStudio installed. For an installation guide, see Tilburg Science Hub.
- OpenAI API Key Configuration: Your R environment must be configured with your OpenAI API Key. For instructions on setting up your API key, check out this article.
Setting Up the OpenAI API in R
Install the OpenAI R Package
Start with opening Rstudio and creating a new R file. Install the OpenAI R package. You can do so by running install.packages("openai")
.
Import the Required Library
After installation, import the OpenAI package as a library using library(openai)
to enable API interactions.
Initialize your API Key in R
For security and convenience, set your OpenAI API Key as an environment variable within R. Activate your API key within R using: Sys.setenv(OPENAI_API_KEY = 'YOUR_API_KEY')
. Replace 'YOUR_API_KEY'
with the actual key you obtained from OpenAI. This step makes sure your API key remains secure and easily accessible throughout the R script.
# OpenAI API Setup
# Install and load the OpenAI R package
install.packages("openai")
library(openai)
# It's recommended to set your API Key in an environment file or variable for security reasons
# Set your OpenAI API Key (replace 'your_api_key_here' with your actual key)
Sys.setenv(OPENAI_API_KEY = 'your_api_key_here')
Extracting Text from PDFs
This section of the article walks you through loading your PDF document into R, extracting the text, and preparing it for analysis.
1. Install the pdftools
and tidyverse
packages necessary for handling PDF files inside R. After you have completed the installation of both packages, import them as a library.
2. Extract the PDF Text: This step involves a two-step procedure aimed at retrieving text from the PDF document. The first step requires specifying the location of your PDF file, and the second involves using the pdftools package to extract the text from the PDF file.
3. Store and Reload the PDF Text: Use the write_rds()
function from the readr package to save your extracted text. This will create an RDS file containing the text, which will be accessed later. Afterwards, when working with the extracted text, load it back into your R environment using the read_rds()
function.
# PDF Text Extraction
# Add required libraries for handling PDF files and data
install.packages("pdftools")
install.packages("tidyverse")
library(tidyverse)
library(pdftools)
# Define the path to the PDF document
pdf_path <- "path_to_your_pdf/filename.pdf"
# Extract text from the PDF
extracted_text <- pdf_text(pdf_path)
# Optional: save and reload the extracted text
extracted_text %>% write_rds("extracted_text.rds")
extracted_text <- read_rds("extracted_text.rds")
Analyzing Text with OpenAI API
Once you have extracted the text from the PDF, the next step involves dispatching this data to the OpenAI API. This API can function as your personal digital assistant or chatbot for the extracted text. It is capable of answering specific questions, providing translations, or summarizations, contingent on your prompts
1. Install and Load the HTTP Package
First, install and load the httr package in R. The httr
package in R is used for making HTTP requests, including sending data to and receiving data from the OpenAI API:
install.packages("httr")
library(httr)
2. Define the API Endpoint
Next, specify the endpoint variable to the OpenAI API’s URL. The API endpoint is the URL where your request will be sent. For interacting with the OpenAI API, define the endpoint as follows:
endpoint <- "https://api.openai.com/v1/chat/completions"
3. Prepare the Prompt and Document Text
Specify your prompt for the analysis, in this case, related to Claudia Goldin’s research. In addition, format the document text for the API.
# Prompt for analysis
analysis_prompt <- "In what ways has Goldin's research contributed to our understanding of the dynamics behind the gender gap in earnings and employment?"
# Clean text
formatted_text <- str_c(extracted_text, collapse = "\\n")
4. Construct the Request Body
The following step is important. Here, we construct what’s referred to as the body. This contains the chosen model for answering the prompt, as well as the interactions between the user and the API system. Thus, The system adopts a specified role, whereas the user outlines both the prompt and the specified text from which the answer should be extracted.
model="gpt-3.5-turbo
: Specifies the model to use for the text generation. In this case, “gpt-3.5-turbo” is chosen. For more alternative models you can check this overview.messages
: This is a list of messages that simulate a conversation between a user and the assistant. Each message is a dictionary with two keys: role and content. The role can be either “system”, “user”, or “assistant”, indicating the sender of the message.
# Construct the API request Body
request_body <- list(
model = "gpt-3.5-turbo",
messages = list(
list(role = "system", content = "You are a smart and eager to help assistant."),
list(role = "user", content = str_c(analysis_prompt, formatted_text))
)
)
5. Send the Request
Use the POST function from the httr package to send the request to the API. Specify the following arguments:
url = endpoint
: This URL is the direct line to the OpenAI API’s services, where we’ll be sending the request.body = body
: Acts as the core of the request, housing the data for the API to process. This includes the chosen model, the prompt, and instructions for the conversation between the user and the system. The body is a structured package, made in step 4, of the query and instructions for the API.encode = "json"
: Signals that the data within our body is formatted in JSON. This makes sure that the API correctly interprets the structure and content of our request.add_headers(...)
: specifies within the request the metadata for proper handling and authentication. This includes:Authorization = paste("Bearer", Sys.getenv("OPENAI_API_KEY")
: A security measure that checks the access rights to the API by including the previously obtained API key. The “Bearer” token is a standard way to present this key.Content-Type = "application/json"
: Specifies the type of data being sent, stating that the request is encoded in JSON format.
# Execute the POST request to the OpenAI API
api_response <- POST(
url = api_endpoint,
body = request_body,
encode = "json",
add_headers(`Authorization` = paste("Bearer", Sys.getenv("OPENAI_API_KEY")), `Content-Type` = "application/json")
)
6. Interpreting the Response
Once the request is processed by the OpenAI API, we the API stores the response in the api_response
variable. However, for R to understand the response, we need to transform this information. Therefore, we use the following function content(api_response, "parsed")
. Parsing, in this context, means that the data retrieved from the API, in JSON format, is converted into a format that R can understand and directly work with. This allows you to access and manipulate the information within R.
# Process the response from the API
response_data <- content(api_response, "parsed")
# Save and optionally reload the response data for review
response_data %>% write_rds("answer/analysis_results.rds")
response_data <- read_rds("answer/analysis_results.rds")
7. Reviewing the Insights
The last step is to extract the specific part of the API’s response that contains the answer to your prompt and save it to a text file.
# Extract the summary from the API's response and display it
api_summary <- response_data$choices[[1]]$message$content
cat(api_summary, output = "filename.txt")
Claudia Goldin's research has made significant contributions to our understanding of the dynamics behind the gender gap in earnings and employment. Here are some key ways in which her work has advanced our knowledge: 1. Historical Perspective: Goldin's research provides a historical perspective on the evolution of women's labor market outcomes, particularly in the United States. By looking at long-term trends and analyzing historical data, she has uncovered the drivers of change over time and how social and economic factors have influenced women's participation rates and earnings. 2. Unifying Economic Framework: Goldin developed a coherent framework for studying the labor market outcomes of women, connecting education, fertility, productivity, aspirations, and institutional change. She emphasized the importance of understanding how supply and demand factors shape women's employment and wages, highlighting constraints that impact female labor supply decisions. ...
Conclusion
In summary, this article outlines the steps to use the OpenAI API within R for extracting information from PDF files. This way you can for example summarise your pdfs without having to pay for ChatGPT4. Highlighted is an example case where the API is used to answer questions related to the research of Claudia Goldin. We have shown how to extract information from paper without the need to read them. You can change your prompts towards more precise questions regarding the paper under study, making it able to in a sense ask your pdf questions. We highly recommend customizing the script according to your specific research needs.
Below, you will find the template code that you can use and adjust.
# Scraping PDF with OpenAI API in R
# 1. OpenAI API Setup
# Install and load the OpenAI R package
install.packages("openai")
library(openai)
# We recommend to set your API Key in an environment file or variable for security reasons
# replace 'your_api_key_here' with your actual key
Sys.setenv(OPENAI_API_KEY = 'your_api_key_here')
# 2. PDF Text Extraction
# Add required libraries for handling PDF files and data
install.packages("pdftools")
install.packages("tidyverse")
library(tidyverse)
library(pdftools)
# Define the path to the PDF document
pdf_path <- "path_to_your_pdf/filename.pdf"
# Extract text from the PDF
extracted_text <- pdf_text(pdf_path)
# Optionally, save and reload the extracted text
extracted_text %>% write_rds("extracted_text.rds")
extracted_text <- read_rds("extracted_text.rds")
# Example: View text from specific pages
length(extracted_text) # Total number of pages extracted
extracted_text[1] # Text from the first page
extracted_text[6] # Text from the sixth page
# 3. Analyzing the PDF Document with OpenAI API
# Load the httr package for HTTP requests
install.packages("httr")
library(httr)
# Set the API endpoint for chat completions
api_endpoint <- "https://api.openai.com/v1/chat/completions"
# Prompt for analysis
analysis_prompt <- "In what ways has Goldin's research contributed to our understanding of the dynamics behind the gender gap in earnings and employment?"
# Clean text
formatted_text <- extracted_text %>%
str_c(extracted_text, collapse = "\\n")
# Construct the API request Body
request_body <- list(
model = "gpt-3.5-turbo",
messages = list(
list(role = "system", content = "You are an smart, eager to help and precise assistant."),
list(role = "user", content = str_c(analysis_prompt, formatted_text))
)
)
# Execute the POST request to the OpenAI API
api_response <- POST(
url = api_endpoint,
body = request_body,
encode = "json",
add_headers(`Authorization` = paste("Bearer", Sys.getenv("OPENAI_API_KEY")), `Content-Type` = "application/json")
)
# Process the response from the API
response_data <- content(api_response, "parsed")
# Save and optionally reload the response data for review
response_data %>% write_rds("answer/analysis_results.rds")
response_data <- read_rds("answer/analysis_results.rds")
# Extract the summary from the API's response and display it
api_summary <- response_data$choices[[1]]$message$content
cat(api_summary, file = "output.txt")
Source
This article builds on the idea previously published by Business Science. The original source code is available on this page.