Find helpful tutorials or share inspiring use cases on AI technology for higher education.

Shared content may not reflect the policies of Tilburg University on the use of AI. 

Web Data Extraction in 3 Steps: From Scraping to Visualization using ChatGPT

Ever looked at interesting web data but found yourself not being able to access that data?

Data is often not readily available in CSV format, and web scraping the data is a complex art. That’s why we show how ChatGPT can be employed to directly harvest and analyze web data using R. We will guide you through extracting and transforming data and creating insightful visualizations with ggplot. By the end of this guide, you will be familiar with accessing and turning various datasets from the web into meaningful narratives.

Let’s explore how to extract the “All-time Olympic Games medal table” from Wikipedia as an illustrative example, but remember that the techniques we discuss can apply to virtually any data on the web. 

From Scraping To Visualization Data with R and ggplot: A Step-by-Step Guide

Step 1: Fetching Data from your Data Source

Start by locating your desired data on the web. Simply copy and paste the information into your clipboard sets the stage for its transformation into R. In the image below, you see the data we selected.

With the data in your clipboard, you are now ready to dive into ChatGPT.

Tip Leveraging ChatGPT’s Image Analysis: A tip for users of ChatGPT-4, utilize its image analysis capability. Instead of text, upload a screenshot of your data table and let ChatGPT extract the information.

Step 2: The Art of Prompt Crafting

The quality of your prompts directly influences ChatGPT’s output. Provide detailed context to ensure accuracy and avoid the common trap of ‘garbage in, garbage out.’ For tips on crafting effective prompts, see our related post on the art of prompting.

Step 3: ChatGPT and R Integration

Use this exemplary prompt to convert your data into an R data frame, and then craft a ggplot visualization for an insightful analysis:

Prompt Example

Here is ChatGPT’s response:

To convert the provided table into an R dataframe using the data.frame() function and create a visualization using ggplot, you can follow these steps: Step 1: Create the dataframe using the data.frame() function:

# Create a dataframe from the provided data 
df <- data.frame(
  Nation = c(
    "United States (USA)",
    "Russia (RUS)",
    "Germany (GER)",
    "Great Britain (GBR)",
    "China (CHN)",
    "France (FRA)",
    "Italy (ITA)",
    "Hungary (HUN)",
    "Japan (JPN)",
    "Australia (AUS)"
  ),
  Gold = c(1061, 608, 438, 284, 263, 223, 217, 181, 169, 164),
  Silver = c(830, 515, 457, 318, 199, 251, 188, 154, 150, 173),
  Bronze = c(738, 502, 491, 314, 174, 277, 213, 176, 178, 210),
  Total = c(2629, 1625, 1386, 916, 636, 751, 618, 511, 497, 547)
)

Step 2: Load the ggplot2 library and create a visualization. Here, I’ll create a bar plot to visualize the total medals won by each country:

# Load the ggplot2 library 
library(ggplot2) 

# Create a bar plot 
ggplot(df, aes(x = reorder(Nation, -Total), y = Total, fill = Nation)) + 
geom_bar(stat = "identity") + 
labs(title = "Top 10 Olympic Medal-Winning Countries", x = "Nation", y = "Total Medals") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
scale_fill_discrete(guide = FALSE) # Remove legend 

This code will create a bar plot showing the total medals won by each of the top 10 Olympic medal-winning countries, with the countries sorted by the total number of medals in descending order. Adjust the plot aesthetics and labels to suit your analysis and presentation preferences.

Are you working with Python? Just specify the Python language in your prompt to ChatGPT, and that’s it!

Copy the prompt and ChatGPT’s response, and that’s it. You have created a neatly formatted data frame in R and subsequently created a visualization with your analysis. 

Considerations

  • This approach is best suited for smaller datasets. For larger ones, consider web scraping techniques.
  • Always verify the accuracy of the data and the code before proceeding.
  • The prompt limit is crucial; breaking down the prompt may be necessary for lengthy datasets.

To wrap up, we’ve showcased how AI and R can work together in your data analysis endeavors. Adding ggplot in your plot allows for quick visualizations, making your insights stand out.

Let the speed of ChatGPT set the pace for your insightful analytical ventures!