Find helpful tutorials or share inspiring use cases on AI technology for higher education.

Shared content may not reflect the policies of Tilburg University on the use of AI. 

How to Generate High-Quality Summaries from PDFs Using Python and OpenAI

Introduction

When using AI, like ChatGPT, to summarize documents, you may want to maintain control over the summary produced. For starters, consider the desired length, format, and whether presented as bullet points or in continuous prose. However, a good summary evolves progressively, building upon the document’s content. This means that the concluding sections of the summary should avoid reiterating points made earlier.

In this article, we’ll explore how to maintain control over AI-generated summaries using Python and the OpenAI API. We’ll demonstrate how to adjust the length and level of detail in your summaries by splitting documents into chunks and processing each individually. This approach secures that the final summary is proportionate to the original document’s length and adjusted to your preferences.

Check out the associated Github Repository to download the code yourself!

Why Not Just Ask ChatGPT to Summarize?

If you ask ChatGPT to summarize a document, you’ll tend to get back a relatively short summary that isn’t proportional to the length of the document. For instance, a summary of a 20-page document will not be twice as long as a summary of a 10-page document.

The solution we will implement here is to split the document into pieces, called chunks. For each chunk, we will produce a summary. Once the AI has processed all the chunks, we can compile the full summary. By adjusting the number and size of the text chunks, we can control the level of detail in the output.

Prerequisites: Accessing the OpenAI API

Before you begin this tutorial, you will need to have access to the OpenAI API. If you are not familiar with what this is, then this is the article to get you started!

Summarizing Documents with the OpenAI API in Python

In this section, we’ll walk through Python code that uses the OpenAI API along with some custom functions to create a script that generates summaries. Especially we will work through each step so you know how to control AI-generated summaries by splitting a document into chunks and summarizing each one individually.

Setting Up Your Python Environment

First, we import the necessary libraries:

Python
import os
from typing import List, Tuple, Optional
from openai import OpenAI
import tiktoken
from tqdm import tqdm
import fitz

Custom Functions

Converting PDF Documents to Text Using Python

Next, we define a custom function to load a specified file. This could be, for example, a paper you want to summarize or a long-form article from the internet. We’ll use the fitz library from PyMuPDF to extract text from a PDF file.

Python
def convert_pdf_to_text(path):
    """Extracts text from a PDF file using fitz."""
    doc = fitz.open(path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text
    
  # Path to your PDF file
pdf_path = '/path/to/your/file.pdf'

# Extract text from the PDF
text = convert_pdf_to_text(pdf_path)

For this tutorial, we will use the Nobel Committee’s explanation of Goldin’s research from 2023. We do this by specifying the PDF file path as pdf_path = 'files/advanced-economicsciencesprize2023.pdf'.

Before proceeding, we need to explain some terminology. We will measure the length of the file not by the total number of words it contains but by the number of tokens. For those not familiar with tokens, here’s a simple explanation:

What are tokens?

In natural language processing (NLP), tokens are the basic units of text that models, such as language models (e.g., GPT-4), use to process and generate language. Tokens are smaller parts of the text, and they can represent whole words, parts of words, or even punctuation marks.

Think of tokens as the building blocks for any text the model interacts with. The process of splitting text into tokens is called tokenization.

Example of Tokenization:

Let’s say we want to tokenize the word “fantastic”:

  • The word “fantastic” might be broken down into several tokens, such as:
    "fan""tas""tic". This shows that a single word can be broken down into smaller, meaningful pieces called tokens. This is because some parts of a word can be reused across different words. For example, “tic” could also appear in “tick”.

Similarly, a sentence like “Hello, world!” might be split into the following tokens:

  • "Hello"",""world""!" Here, we see that punctuation marks like commas and exclamation points also count as tokens.

General Token Rules:

  • In English, on average, one token equals about four characters or 0.75 words. This can vary, especially with other languages or technical terms.Rough guidelines:
    • 1,000 tokens ≈ 750 words
    • 1 token ≈ 4 characters

How Does Tokenization Work?

Tokenization is the process by which text is divided into these tokens. While for humans it’s easy to split words based on spaces, models need to account for:

  • Subword tokenization: The model doesn’t always split by spaces; it looks for parts of words or common word segments. This is why “fantastic” might be split into "fan""tas""tic".
  • Punctuation handling: Special characters, punctuation, and spaces are also treated as individual tokens.

Why is Tokenization Important?

Tokens are critical because they dictate how the AI processes the text. The model doesn’t “see” the full text but rather interacts with the tokens. Each token has a unique meaning for the model, and it uses these tokens to generate responses or understand the input.

Tokenization in AI Models

In GPT-3.5, GPT-4, and similar models, tokenization helps the AI process large chunks of text efficiently. However, tokenization can vary across models:

  • Newer models like GPT-4 use more sophisticated tokenization techniques compared to earlier versions.
  • Different languages and writing systems may be tokenized differently, especially when dealing with non-Latin alphabets (like Chinese characters or Arabic script).

Visual Token Example:

Look at the sentence:
“This text contains a number x amount of tokens.”

  • In this case, the model splits the sentence into 11 tokens, even though it contains 48 characters. This happens because some tokens are shorter (like “a”) or single punctuation marks (like spaces or periods) also count as tokens.

In another sentence:
“Many words map to one token, but some don’t: indivisible.”

Here, 14 tokens are produced from 57 characters because the tokenization process breaks down longer, complex words like “indivisible” into smaller segments.

The text you’ve processed can be divided into manageable chunks based on the number of tokens it contains. Different models have specific token limits that include both the input (prompt) and output (completion). For instance, some models, like GPT-4 Turbo, support up to 128,000 tokens. This token limit dictates how much text the model can handle in a single request.

Using Python, you can calculate the number of tokens in your text by selecting the appropriate token encoding for the model you’re using. In this case, the code snippet below shows how you can tokenize text for GPT-4 Turbo:

Python
# Select the encoding that matches the model you're using
encoding = tiktoken.encoding_for_model('gpt-4-turbo')

# Encode your text into tokens
tokens = encoding.encode(text)

# Get the total number of tokens
token_count = len(tokens)
print(f"The text contains {token_count} tokens.")

In our example, the text contains 23,158 tokens.

Implementing OpenAI API Call

Next, we are creating a custom function to simplify the process of sending messages to the AI model and receiving responses. This function takes on a list of messages and an optional model name, we have set here the default to gpt-4-turbo), sends the messages to the specified OpenAI model, and then returns the AI’s reply.

This abstraction allows us to easily reuse this function throughout our script without repeatedly writing the API call logic.

Python
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def get_chat_completion(messages, model='gpt-4-turbo'):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    return response.choices[0].message.content

Perfect, now we will set up 3 other different custom functions, The three custom functions work iteratively to process the text: the Tokenization Function first converts the input text into tokens according to the encoding scheme of the specified model (as they differ across models). Next, the Text Chunking Function takes these tokens and segment them into smaller pieces based on a defined maximum token count and a specified delimiter, meaning we are left with several chunks of text. Finally, the Chunk Combination Function takes the individual chunks generated from the previous step and combines them back into one single text.

Building Custom Tokenization Function for PDFs

Python
def tokenize(text: str) -> List[str]:
  encoding = tiktoken.encoding_for_model('gpt-4-turbo')
  return encoding.encode(text)

As we already explained earlier in the article, this tokenize function converts a given piece of text into a list of tokens. Tokens are the basic units that language models like GPT-4 use to process and generate text. They can be words, subwords, or even single characters, depending on the language and the model’s encoding scheme.

Building Chunking Function

This function is already getting rather complex, so let’s break down what the function does. The chunk_on_delimiter function splits a large text into smaller chunks based on a specified delimiter (like a period . or newline \n) and then combines these chunks without exceeding a maximum token count per chunk.

First, we define three input variables:

  • input_string: The text to be split.
  • max_tokens: The maximum number of tokens allowed per chunk.
  • delimiter: The character or string on which to split the text.

Then the function splits the input text into smaller pieces wherever the delimiter occurs. having the chunks it calls a new function combine_chunks_with_no_minimum to recombine these pieces into larger chunks that don’t exceed the max_tokens limit that you can specify. If a chunk exceeds the limit, it may be dropped, and a warning is issued. thereafter It adds the delimiter back to the end of each combined chunk to maintain the structure of the text.

Now we are left with a list of text chunks each each ending with the delimiter, and within the token limit.

By splitting the text and combining delimiters instead of words we maintain the logical flow of the text (keeping sentences intact).

Python
def chunk_on_delimiter(input_string: str,
                       max_tokens: int, 
                       delimiter: str) -> List[str]:
    chunks = input_string.split(delimiter)
    combined_chunks, _, dropped_chunk_count = combine_chunks_with_no_minimum(
        chunks, max_tokens, chunk_delimiter=delimiter, add_ellipsis_for_overflow=True
    )
    if dropped_chunk_count > 0:
        print(f"warning: {dropped_chunk_count} chunks were dropped due to overflow")
    combined_chunks = [f"{chunk}{delimiter}" for chunk in combined_chunks]
    return combined_chunks

Building a Combining Chunks Function

Now with the individual text chunks left, we need to combine the smaller text chunks into larger ones without exceeding a specified maximum token count.

The function receives 5 input variables:

  • chunks: A list of text pieces to combine.
  • max_tokens: The maximum token count per combined chunk.
  • chunk_delimiter: The string used to join chunks together.
  • header: An optional string to prepend to each chunk.
  • add_ellipsis_for_overflow: A boolean indicating whether to add ‘…’ when chunks are dropped due to size.

Combining Chunks Without Compromising Quality

Next up the initialize variables to keep track of the combined chunks and their indices. The function goes through each chunk and checks if it can be added to the current candidate chunk without exceeding the token limit. If an individual chunk is too large even on its own, it may be skipped. but if add_ellipsis_for_overflow is True, it adds ‘…’ to indicate skipped content.

If adding the chunk keeps the token count within the limit, it appends it to the candidate. If it exceeds the limit, it finalizes the current candidate and starts a new one.

We end up with a tuple containing:

  • output: A list of combined text chunks within the token limit.
  • output_indices: The indices of the original chunks are included in each combined chunk.
  • dropped_chunk_count: The number of chunks that were too large and had to be dropped.
Python
def combine_chunks_with_no_minimum(
        chunks: List[str],
        max_tokens: int,
        chunk_delimiter="\n\n",
        header: Optional[str] = None,
        add_ellipsis_for_overflow=False,
) -> Tuple[List[str], List[int]]:
    dropped_chunk_count = 0
    output = []  # list to hold the final combined chunks
    output_indices = []  # list to hold the indices of the final combined chunks
    candidate = (
        [] if header is None else [header]
    )  # list to hold the current combined chunk candidate
    candidate_indices = []
    for chunk_i, chunk in enumerate(chunks):
        chunk_with_header = [chunk] if header is None else [header, chunk]
        if len(tokenize(chunk_delimiter.join(chunk_with_header))) > max_tokens:
            print(f"warning: chunk overflow")
            if (
                    add_ellipsis_for_overflow
                    and len(tokenize(chunk_delimiter.join(candidate + ["..."]))) <= max_tokens
            ):
                candidate.append("...")
                dropped_chunk_count += 1
            continue  # this case would break downstream assumptions
        # estimate token count with the current chunk added
        extended_candidate_token_count = len(tokenize(chunk_delimiter.join(candidate + [chunk])))
        # If the token count exceeds max_tokens, add the current candidate to output and start a new candidate
        if extended_candidate_token_count > max_tokens:
            output.append(chunk_delimiter.join(candidate))
            output_indices.append(candidate_indices)
            candidate = chunk_with_header  # re-initialize candidate
            candidate_indices = [chunk_i]
        # otherwise keep extending the candidate
        else:
            candidate.append(chunk)
            candidate_indices.append(chunk_i)
    # add the remaining candidate to output if it's not empty
    if (header is not None and len(candidate) > 1) or (header is None and len(candidate) > 0):
        output.append(chunk_delimiter.join(candidate))
        output_indices.append(candidate_indices)
    return output, output_indices, dropped_chunk_count

To get a grasp of the function we have created a hypothetical example:

Stylized Example

Let’s say you have the following text and you want to process it without exceeding 50 tokens per chunk:

“Hello there! This is a test of the chunking functions. We need to ensure that our text is split and combined properly. Token limits are important when working with language models. Let’s see how this works.”

Step 1: Splitting the Text

Using the chunk_on_delimiter function with a period . as the delimiter:

  • Splits into:
    1. “Hello there!”
    2. ” This is a test of the chunking functions”
    3. ” We need to ensure that our text is split and combined properly”
    4. ” Token limits are important when working with language models”
    5. ” Let’s see how this works”

Step 2: Combining Chunks

  • First Combined Chunk:
    • Starts with chunk 1.
    • Adds chunk 2; total tokens are within limit.
    • Adds chunk 3; total tokens still within limit.
  • Second Combined Chunk:
    • Tries to add chunk 4; would exceed token limit.
    • Finalizes first combined chunk with chunks 1-3.
    • Starts new combined chunk with chunk 4.
    • Adds chunk 5; total tokens within limit.

Step 3: Final Output

  • Combined Chunks:
    1. “Hello there! This is a test of the chunking functions. We need to ensure that our text is split and combined properly.”
    2. “Token limits are important when working with language models. Let’s see how this works.”

Each combined chunk respects the maximum token limit and ends at logical points in the text.

Customizing the Summarization Function

Our last custom function generates summaries of a given text with an adjustable level of detail controlled by the detail parameter, which ranges from 0 to 1. A lower detail value results in a concise summary by summarizing the entire text as a whole, while a higher value produces a more detailed summary by splitting the text into more chunks and summarizing each one individually. This parameter allows users to fine-tune the granularity of the summary to suit their specific needs.

The function works by dynamically determining the number of chunks to split the text into based on the desired level of detail. It adjusts the chunk size accordingly and then summarizes each chunk separately. Optional features like recursive summarization (summarize_recursively) enable the function to consider previous summaries when processing new chunks, improving coherence and context. Additional customization is possible through the additional_instructionsparameter, allowing you to provide specific guidelines to the language model.

Python
def summarize(text: str,
              detail: float = 0,
              model: str = 'gpt-4-turbo',
              additional_instructions: Optional[str] = None,
              minimum_chunk_size: Optional[int] = 500,
              chunk_delimiter: str = ".",
              summarize_recursively=False,
              verbose=False):


    # check detail is set correctly
    assert 0 <= detail <= 1

    # interpolate the number of chunks based to get specified level of detail
    max_chunks = len(chunk_on_delimiter(text, minimum_chunk_size, chunk_delimiter))
    min_chunks = 1
    num_chunks = int(min_chunks + detail * (max_chunks - min_chunks))

    # adjust chunk_size based on interpolated number of chunks
    document_length = len(tokenize(text))
    chunk_size = max(minimum_chunk_size, document_length // num_chunks)
    text_chunks = chunk_on_delimiter(text, chunk_size, chunk_delimiter)
    if verbose:
        print(f"Splitting the text into {len(text_chunks)} chunks to be summarized.")
        print(f"Chunk lengths are {[len(tokenize(x)) for x in text_chunks]}")

    # set system message
    system_message_content = "Rewrite this text in summarized form."
    if additional_instructions is not None:
        system_message_content += f"\n\n{additional_instructions}"

    accumulated_summaries = []
    for chunk in tqdm(text_chunks):
        if summarize_recursively and accumulated_summaries:
            # Creating a structured prompt for recursive summarization
            accumulated_summaries_string = '\n\n'.join(accumulated_summaries)
            user_message_content = f"Previous
            summaries:\n\n{accumulated_summaries_string}\n\nText to summarize next:\n\n{chunk}"
        else:
            # Directly passing the chunk for summarization without recursive context
            user_message_content = chunk

        # Constructing messages based on whether recursive summarization is applied
        messages = [
            {"role": "system", "content": system_message_content},
            {"role": "user", "content": user_message_content}
        ]

        # Assuming this function gets the completion and works as expected
        response = get_chat_completion(messages, model=model)
        accumulated_summaries.append(response)

    # Compile final summary from partial summaries
    final_summary = '\n\n'.join(accumulated_summaries)

    return final_summary

Practical Use Cases

Now that we’ve looked extensively at the code we understand how the summarization function works. Let’s focus on some practical use cases. in this section, we’ll demonstrate how to apply the summarize function to generate summaries with varying levels of detail and customization. We’ll see how adjusting the detail parameter influences the length and depth of the summaries, and give an example of how additional instructions can adjust the output to specific formats or focus.

You can adjust the summarize_recursively parameter to True for recursive summarization, where each summary is based on the previous summaries, adding more context to the summarization process. This is more computationally expensive but can increase the consistency and coherence of the combined summary.

Using the Detail Parameter in Action

By increasing detail from 0 to 1 we get progressively longer summaries of the underlying document. A higher value for the detailparameter results in a more detailed summary because the utility first splits the document into a greater number of chunks. Each chunk is then summarized, and the final summary is a concatenation of all the chunk summaries.

Python
summary_with_detail_0 = summarize(text, detail=0, verbose=True)
Splitting the text into 1 chunks to be summarized.
Chunk lengths are [23159]
Output Summary with Detail 0.25

The 2023 Sveriges Riksbank Prize in Economic Sciences was awarded to Claudia Goldin for her significant contributions to understanding women’s labor market outcomes. Goldin’s research has provided a comprehensive analysis of the economic history of women, revealing the multifaceted nature of gender differences in the labor market, including the persistent gender gaps in participation, earnings, and advancement opportunities. Her work has highlighted the historical shifts in women’s work from home to marketplace and the impact of various factors such as education, technological changes, and social norms on these shifts. Goldin’s framework connects education, fertility, and productivity to the evolution of women’s roles in the economy, emphasizing the constraints women face due to social norms and institutional barriers. Her findings suggest that reducing gender disparities in employment could significantly boost global GDP, making gender equality not only a matter of fairness but also economic efficiency.

Python
summary_with_detail_pt25 = summarize(text, detail=0.25, verbose=True)
Splitting the text into 14 chunks to be summarized.
Chunk lengths are [1746, 1736, 1748, 1770, 1742, 1775, 1747, 1753, 1751, 1762, 1777, 1778, 1778, 299]
Output Summary with Detail 0.5

The Royal Swedish Academy of Sciences awarded the 2023 Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel to Claudia Goldin for her contributions to understanding women’s labor market outcomes. Goldin’s research has provided a comprehensive historical and economic analysis of gender differences in the labor market, highlighting the persistent underrepresentation and wage gaps women face globally. Despite higher participation rates in high-income countries and legislative advances, significant disparities remain. Goldin’s work utilizes a framework that connects education, fertility, and productivity to the evolution of women’s roles in the economy, offering insights into the economic and social constraints that influence women’s labor decisions. Her findings challenge the notion that female labor participation directly correlates with economic development, instead suggesting a more complex interplay of demand and supply factors shaped by societal norms and institutional barriers.

Claudia Goldin’s research in 1990 revealed a U-shaped trend in the long-term evolution of female labor market participation in the U.S., challenging previous data by showing that the upward trend began later than previously thought. This U-shaped pattern, initially hypothesized from cross-country data, was first documented within a single country’s development through Goldin’s work. She identified the primary drivers of these trends as changes in the employment of married women, influenced by the expansion of white-collar jobs, technological advancements, and increased access to education. However, social stigmas and institutional barriers like marriage bars limited the full impact of these opportunities.

Goldin also provided the first evidence of how gender earnings gaps evolved, showing that structural shifts in labor demand historically benefited women even before wage equality movements. Despite a narrowing earnings gap during the Industrial Revolution and the rise of clerical work, wage discrimination increased during this period. Post-1930s, the earnings gap stabilized despite significant economic growth and increased female participation.

Goldin’s further studies highlighted the slow change in labor market outcomes, influenced by cohort effects where each generation’s opportunities and decisions impact overall trends. She noted significant shifts in the 1970s with increased female education and the introduction of the birth control pill, which allowed women to delay marriage and invest in careers.

In contemporary times, despite higher education and participation rates among women, earnings gaps persist, primarily due to parenthood and workplace inflexibility. Goldin’s research underscores the importance of understanding historical and economic contexts to address contemporary gender gaps effectively, providing valuable insights for policymakers in designing interventions to reduce these disparities.

The document discusses the evolution of gender gaps in labor markets, focusing on Claudia Goldin’s contributions to understanding female labor force participation and wage rates over the past 250 years. Goldin’s research extended the timeline of female labor market participation back to the late 1700s, revealing a U-shaped trend over time. She also highlighted the significant role of married women re-entering the labor market post-child-rearing years and the persistence of wage discrimination despite economic and social progress. Goldin’s findings suggest that the gender earnings gap is now more influenced by within-occupation differences rather than between different occupations. The document further outlines the structure of the remaining sections, which delve deeper into these topics and discuss implications for contemporary policy debates.

Goldin’s research highlights the historical trends in female labor force participation in the U.S., showing a U-shaped curve over time. Initially, female participation, especially among young and unmarried women, increased during the early stages of industrialization, as they found employment in manufacturing. However, in the late 19th century, participation rates began to decline due to societal pushback and protective legislation, which introduced barriers to women’s employment. Goldin also explored the role of married women, revealing their significant but often unrecorded contribution to the labor market through “hidden market work” in family businesses. Over the 20th century, the participation of married women in the labor market increased significantly, particularly among white married women, contrasting with the already high participation rates among black married women dating back to 1900. Goldin’s work has inspired further research into women’s economic roles and their impact on long-term economic development.

The labor force participation rate of married women was under 6% in 1790 but reached around 50% by 1970, showing a significant increase over the century. This rise was primarily due to higher rates of reentry into the workforce by married women later in life, rather than a change in the pattern of women leaving jobs upon marriage. The participation rates of white married women born between 1866 and 1965 illustrate this trend, with significant increases observed from the 1950s onward. For instance, participation rates for women aged 50 doubled every decade from 1940 to 1960.

Claudia Goldin’s research in 1990 provided a detailed analysis of the gender earnings gap, revealing that while the gap narrowed during the Industrial Revolution and the rise of white-collar jobs, it remained relatively stable from 1880 to the 1960s. The nature of wage discrimination evolved, with significant disparities emerging in white-collar jobs by 1940 due to the difficulty in monitoring productivity and the rise of long-term employment contracts and internal promotion systems that favored men. This structural change in labor markets led to increased wage discrimination in modern labor markets, where gender often influenced promotion decisions and pay scales.

Over the past fifty years, there has been a significant convergence in the gender earnings gap, yet women continue to earn less than men, with a 13% average gap across OECD countries in 2020. This gap varies by country, with Sweden at 7% and the US at 18%. Historically, differences in human capital and occupational choices between genders contributed to this gap, but over time, as women’s education and career choices have aligned more closely with men’s, the focus has shifted. Claudia Goldin’s research highlights that the majority of the current earnings gap stems from differences within occupations rather than between them. Despite women increasingly surpassing men in educational attainment, particularly in non-STEM fields, earnings disparities within the same occupations persist and have become a more significant factor in the gender earnings gap. Goldin’s findings suggest that addressing within-occupation disparities could have a greater impact on closing the gender earnings gap than equalizing gender distribution across different fields.

The evolution of labor market gender gaps has been significantly influenced by structural changes such as the shift from agriculture to manufacturing, the rise of clerical work, and the expansion of the service sector. These changes have affected female employment and earnings differently across various stages of economic development. Claudia Goldin’s research highlights how these shifts, along with technological innovations and educational reforms, have played a crucial role in altering women’s labor market outcomes from the 19th century onwards.

In the early 19th century, the Industrial Revolution led to increased female labor force participation, particularly among young unmarried women in manufacturing, which helped narrow the gender earnings gap. However, social norms and the separation of home and work reduced married women’s participation in urban areas. The rise of white-collar jobs from 1890 to 1930 further narrowed the earnings gap, although participation rates changed little. Technological advances in office equipment and the growth of secondary education, particularly the high school movement, transformed clerical work and educational attainment, leading to a predominantly female clerical workforce by the early 20th century.

Overall, these structural and societal changes have shaped women’s decisions regarding education, employment, and family life, significantly impacting the gender dynamics in the labor market.

The clerical sector became more attractive to women due to better working conditions and higher wages compared to manufacturing jobs, leading to a significant shift of women from manufacturing and domestic service into clerical roles. This shift, rather than an increase in overall female labor participation, contributed to the “feminization” of the office. Despite the growth of higher-paying clerical jobs, female participation rates changed marginally from 1890 to 1930, largely because societal norms and regulations, including marriage bars, limited women’s continued employment upon marriage.

Marriage bars, which were prevalent especially in teaching and clerical jobs, prevented the hiring or continued employment of married women. These bars were mostly abolished by the 1940s due to economic pressures and a shortage of young female workers, coupled with a rise in demand for clerical workers. This abolition, along with technological advancements in home production, enabled more married women to re-enter the workforce. The labor market adapted by offering more part-time work, allowing women to balance work and home responsibilities.

From 1970 onwards, a significant change occurred as women began to invest more in education, leading to higher college attendance and graduation rates than men. This investment contributed to a narrowing of the gender wage gap starting around 1980. Factors driving this change included shifting expectations about future employment and the introduction of oral contraceptives, which allowed women to plan their careers and educations more effectively.

The employment predictions for cohorts born in 1947/1948 and 1958/1959 varied significantly, with the latter aligning more closely with the higher employment rates observed in later years. As women’s expectations for their careers evolved, they increasingly invested in higher education, particularly from the 1980s onward, as the returns on college education grew. This period, termed the “quiet revolution” by Claudia Goldin, saw a significant rise in women entering college and professional programs, which contributed to narrowing the gender earnings gap.

The introduction of the birth control pill in the 1960s played a crucial role in this transformation by enabling women to delay marriage and childbirth, thus investing more in their education and careers. State-specific changes in laws during the early 1970s further facilitated access to the pill for young unmarried women, amplifying its impact on their educational and professional choices.

Despite these advances, a gender earnings gap persists, largely due to the “parenthood effect,” where parenthood impacts women’s earnings negatively while often boosting men’s earnings. Studies, such as those conducted by Bertrand, Goldin, and Katz, reveal that even among highly educated professionals like MBA graduates, significant gender earnings disparities emerge over time, exacerbated by career breaks associated with motherhood.

The gender earnings gap, initially observed at 11 log-points at graduation, widens to 31 and 60 log-points after 5 and 10 years, respectively. Research indicates that 84% of this gap can be attributed to factors such as MBA courses and performance, post-MBA experience, time out of the labor market, and hours worked. The primary driver of these differences, particularly in labor supply and career interruptions, is parenthood, with women facing significant employment and earnings reductions post-childbirth, unlike men whose earnings may increase.

Studies across various countries confirm that the parenthood effect is a major contributor to the gender earnings gap, with the impact varying by country. The lack of workplace flexibility is highlighted as a significant factor, where women often face a wage penalty for needing flexible work arrangements to manage child-rearing responsibilities. This issue is exacerbated in jobs that require constant availability and have low substitutability of workers. Some occupations like pharmacy, where worker substitutability is high, show a smaller gender earnings gap.

While workplace flexibility is a key factor, other potential explanations include entrenched gender stereotypes and societal expectations, which may influence women’s career and family decisions. The debate continues on the exact mechanisms driving the gender earnings gap, with ongoing research exploring various dimensions of this complex issue.

Goldin’s research on the U.S. labor market highlights how female labor market outcomes have evolved due to various factors such as industrialization, technological changes, shifts in gender norms, educational opportunities, and institutional changes. These factors have contributed to understanding gender gaps in labor markets, particularly in developing countries. Goldin’s analysis, using data from around 130 countries, shows a correlation between economic development and female labor force participation, suggesting that the patterns observed in the U.S. are applicable globally. Her work indicates that as economies develop, factors like education and social norms play crucial roles in determining the extent and pace of increase in female employment rates. Goldin’s findings have significant policy implications, emphasizing the need to understand the root causes of gender gaps and the interaction of various factors to effectively address these disparities. Her research also underscores the importance of considering women’s expectations about the future in labor supply decisions, highlighting the intertemporal nature of these decisions.

Kleven et al. (2022) and Andersen and Nix (2022) studied the impact of family-friendly policies in Austria and Norway, respectively, finding limited effects on social norms and preferences, with childcare in Norway reducing the parenthood effect by almost a quarter. Research suggests that government policies may influence future expectations and decisions of young women regarding their careers, highlighting the importance of understanding how these expectations are formed, including the role of female role models in challenging stereotypes.

Claudia Goldin’s extensive research over 40 years has significantly contributed to understanding gender convergence in the labor market, emphasizing the role of family, children, and work organization. Her work has inspired further studies on historical labor market outcomes, natural experiments to understand gender gaps, and the impact of parenthood and workplace structures on these gaps. Goldin’s contributions have helped establish the economics of gender as a vital area of economic research, integrating economic history with applied economics.

Claudia Goldin has extensively researched and written on the transformation of women’s roles in the workforce, education, and family life, highlighting significant shifts such as the impact of oral contraceptives on women’s career and marriage decisions, and the evolution of gender equality in professions like pharmacy. Alongside Lawrence F. Katz, Goldin has explored topics ranging from the economic implications of education and technology to workplace flexibility for high-powered professionals. Their work also includes studies on the wage structure in the United States during the mid-20th century and the impact of “blind” auditions on female musicians. Other scholars like Henrik Kleven and Claudia Olivetti have contributed to understanding gender inequality and labor force participation through various international and historical perspectives, examining family policies and structural transformations over time. This body of research collectively provides a deep insight into the dynamics of gender roles and economic development from the early industrial era to the present.

This text appears to be a list of references from various academic works focusing on women’s participation in the labor force, wage disparities, and the impact of family life on women’s careers. The sources span from historical analyses of women’s work from the late 19th century to contemporary studies on topics such as the gender pay gap among highly paid professionals and the specific challenges faced by women with children. The references include journal articles, books, and data from the World Bank, covering a broad time frame and a variety of economic and social perspectives.

Python
summary_with_detail_pt5 = summarize(text, detail=0.5, verbose=True)
Splitting the text into 26 chunks to be summarized.
Chunk lengths are [901, 904, 907, 914, 923, 889, 916, 898, 925, 917, 903, 890, 925, 908, 917, 864, 922, 910, 920, 902, 903, 913, 919, 904, 907, 458]
Output Summary with Detail 0.5
Python
summary_with_detail_1 = summarize(text, detail=1, verbose=True)
Splitting the text into 49 chunks to be summarized.
Chunk lengths are [427, 474, 483, 456, 501, 482, 489, 488, 458, 457, 477, 470, 473, 486, 490, 478, 407, 477, 501, 466, 484, 436, 470, 498, 487, 486, 493, 491, 493, 453, 470, 489, 484, 499, 499, 489, 490, 466, 491, 484, 479, 500, 490, 501, 492, 501, 491, 496, 122]
Output Summary with Detail 1

Remember that the original document is nearly 23,158 tokens long:

summary_with_detail_0: 156 tokens, summary_with_detail_pt25: 3182 tokens, summary_with_detail_pt5: 5668 tokens, summary_with_detail_1: 8837 tokens. Notice how large the gap is between the length of summary_with_detail_0 and summary_with_detail_1. It’s 56 times longer!

Customizing Summaries with Additional Instructions

We cannot solely specify the length of the summary, we can also adjust the format and target of our summary:

Python
summary = summarize(text, detail=0.2, verbose=True, additional_instructions= "Focus on explaining the evolution of the gender gaps in employment and earnings and format in bullet points")

Conclusion

We have walked you through an extensive code script, that allows you to control the length and detail of AI-generated summaries. Splitting documents into manageable chunks and summarizing each one allows you to produce summaries that are related to the original text’s length. Adjusting parameters like the detail level gives you fine-grained control over the summary’s depth, ranging from concise overviews to detailed analyses. While you can customize the point of view of the summary with the additional_instructions argument. Lastly, you can adjust the summarize_recursively parameter to True for recursive summarization, where each summary is based on the previous summaries, adding more context to the summarization process. However, keep in mind this is computationally expensive.

This method overcomes the limitations of standard AI summarization tools, which often provide summaries that are too brief for longer documents. By using custom functions for tokenization, chunking, and combining text, you create coherent and logical flowing summaries.

To download the code check out the following GitHub repository:

Github: AI-Document-Summarizer