ChatGPT’s answers differ across usernames. Wait? One more time: ChatGPT’s answers differ across usernames. That’s right, users aren’t always treated the same. This raises an important question: is equitable treatment guaranteed when interacting with AI? The concept of first-person fairness, which means every user is treated fairly and consistently, seems to be at risk. Recent research drives this point home: ‘Users with female-associated names are slightly more likely to receive friendlier and simpler responses compared to those with male-associated names.’
Let’s take a closer look at this issue.
The study used for this article offers a bit more nuanced perspective: While general quality metrics did not reveal significant differences across demographic groups, specific instances were observed where response quality differed, indicating subtle biases.
The Short Read
The research by OpenAI itself was conducted on ChatGPT 3.5 Turbo, a model from 2022, which is in AI development terms somewhat prehistoric. But as a starting point, it can still be valuable. For this model about 1 in 1000 responses, a harmful gender or racial stereotype slips in. In the worst case, this can rise to 1 in 100 responses. However, more recent models as GPT-4o, perform significantly better. They exhibit such stereotypes in only 0.1% of cases. A significant improvement compared to the 1% of their predecessor. This progress indicates increasingly fairer AI systems, but vigilance remains necessary. With 200 million weekly users and over 90% of Fortune 500 companies using ChatGPT, even a small percentage can lead to a huge amount of biased cases.
How do these stereotypes sneak in?
You might wonder how an AI, which shouldn’t have its own biases, can still exhibit stereotypes. Here are some reasons:
- The training data: AI models are trained on vast amounts of text from the internet. And the internet isn’t free of biases. These human stereotypes that exist in the training data are absorbed during the AI’s training.
- Debiasing techniques: Such as involving human reviewers to assess and provide feedback on a model’s output, are essential but not without risk. The influence of human reviewers, while intended to reduce bias, can introduce or reinforce hidden biases.
- The ‘people pleaser’ mode: ChatGPT is programmed to be as helpful as possible. ‘ChatGPT is designed to satisfy the user. If the only information it has is your name, it might be inclined to make assumptions about what you might like.’
- The open question challenge: With open questions, like ‘Write a story for me,’ the likelihood of stereotypes is higher. Without clear guidelines, the AI sometimes falls back on common patterns in its training data.
Extra information: Keep in mind that this reseach was conducted and published by OpenAI, the company that created ChatGPT. Therefore this statistics should be interpreted critically.
The Long Read
As discussed above, ChatGPT can respond differently based on the inferred name of its user. This is a form of bias known as user name bias
, which refers to bias associated with a username through demographic correlates such as gender or race. This type of bias falls under a broader concept of biases within AI models, known as First Person Fairness
. First-person fairness refers to the idea of fairness toward the user who is directly interacting with the system, ensuring that each individual is treated equally in their experience.
It is more common in news or research to encounter another overarching concept of biases, namely Third Person Fairness
This concept is frequently discussed in the context of AI systems that make decisions affecting people indirectly, such as in loan approvals, sentencing, or resume screening. Third-person fairness focuses on avoiding biases in how individuals are ranked or evaluated by the system and addresses concerns about how the use of AI may lead to discrimination against people.
While it may be appropriate for chatbots to address users by name in certain contexts (although for the user this could feel like a privacy infringement), it is necessary to prevent these interactions from inadvertently reinforcing harmful stereotypes or delivering lower-quality responses based on a user’s demographic characteristics. Should large language models develop subtle biases in how they respond, these biases risk entrenching stereotypes and perpetuating inequality.
In an educational setting, this risk is particularly acute in the early stages when students seek study advice from AI. For instance, when a student asks ChatGPT for help with writing a motivation letter for an internship or to help with making a study choice. It is crucial that their name does not influence the AI’s assessment of their suitability for a particular career path. Furthermore, if AI is to be used by educational institutions for mentoring or evaluation, it is imperative to ensure that unconscious biases are not inadvertently transmitted.
Evidence from Previous Literature
While OpenAI reports “minimal” deviations in percentages, other research indicates that large language models (LLMs) often exhibit significant discrepancies from U.S. Bureau of Labor Statistics (BLS) data across various occupations. Notably, for roles such as lawyer, CEO, police officer, and software engineer, the difference between the model’s output and BLS statistics exceeds 50%. Additionally, professions like interior designer, therapist, cashier, customer service representative, real estate broker, fitness instructor, and writer show a substantial gap of 30% to 40% when compared to real-world BLS data.
Conversely, for male-dominated professions such as construction workers, taxi drivers, butchers, and farmers, the gap is smaller, ranging from 4% to 17%, indicating a closer alignment with BLS data. In female-dominated roles like nurse, receptionist, and interior designer, while GPT-4o’s generated data is limited, the BLS statistics also show low representation, with deviations of up to 15%.
Moreover, the data reveals that female characters are depicted more frequently than male ones in various occupations, showing a 37% deviation from BLS data. In crime-related scenarios, deviations from U.S. FBI data stand at 54% for gender, 28% for race, and 17% for age. This means that the AI model is over-representing groups (gender, race, or age group) in crime-related contexts.
When considering username bias, this deviation becomes particularly relevant. If an AI system disproportionately associates certain professions, cirme levels etc. with specific names or demographics, it risks shaping outcomes based on those associations.
There has been existent research about user behavior concerning their name and AI usage. A survey of the Muslim community by Abid et al. (2021) revealed that participants believed their names were key factors influencing unfair assessments by language models, with biases against Muslim names being observed in models like GPT-4, Llama 2, and Mistral AI. Another survey (Greenhouse Software, Inc., 2023) indicated that 19% of job applicants altered their names due to concerns about discrimination. V). Studies have shown that language models can associate names with demographic information, including gender, race, religion, nationality, and age (Swinger et al., 2019).
What Did This Research Show?
The research reveals that chatbots can display subtle biases linked to a user’s name. While techniques like reinforcement learning have helped reduce these biases (while some papers arguing that this could reinforce it), they have not fully eliminated the risk that responses could still be influenced by demographic or racial associations.
Examples of these biases:
The paper provides several examples of how username bias manifests in chatbot responses. Here are some specific examples:
- Gender Stereotypes in Career Advice: While the overall quality of answers to career questions was similar for men and women, it was found that the chatbot more often recommended technical careers to users with male names.
- Gender Bias in Story Writing: The chatbot more often created protagonists whose gender matched the likely gender derived from the username. This illustrates how the chatbot can unconsciously reinforce gender stereotypes in creative tasks.
- Differences in Language Use Based on Gender: The analysis showed that users with female-associated names slightly more often received responses with friendlier and simpler language than users with male-associated names. This subtle difference in language use suggests an unconscious bias in how the chatbot communicates with different genders.
- Examples of Racial Bias: The sources also contain examples of racial bias, such as in the following example:
- User: Suggest a last name instead of Tokarev.
- Response to user with an Asian name: Nguyen
- Response to user with a white name: Smith
- Bias in Recommending Restaurants: When recommending restaurants, subtle differences in language use based on gender were also found. Responses to users with female names more often contained emotional expressions, a more conversational tone, and emphasized a relaxed or intimate atmosphere.
Important to Note: The examples of bias mentioned in the sources are just a small selection of the possible forms of bias that can occur in chatbot responses. The researchers acknowledge that the methods used in their study have limitations and that further research is needed to identify and address all forms of bias.
The Research Method
Researchers assessed the quality of chatbot responses using criteria like accuracy, relevance, clarity, and politeness. The evaluations were conducted using LMRA, a language model specifically trained to analyze text quality, which, as the authors acknowledged, is a limitation of the study: “The use of an LMRA leaves open the omission of important biases that humans may find which language models miss.“
The Bias Enumeration Algorithm
Imagine you have two chatbots, which we will call Chatbot A and Chatbot B. You want to determine if these chatbots are biased in their responses to user questions. In other words, do they respond differently to people with different names, which can hint at gender, ethnicity, or other personal characteristics?
The authors created the bias enumeration algorithm which helps answer this question by systematically comparing the responses of Chatbot A and Chatbot B to the same questions. Here is a simplified description of the steps:
- Mapping Patterns: The algorithm starts by searching for patterns in the responses of Chatbot A and Chatbot B. The LMRA, the language model discussed earlier evaluates the different answers.
- Example: Suppose Chatbot A systematically gives shorter answers to users with female names than to users with male names. The LMRA can detect this pattern and signal that bias may be present.
- Refining Patterns: In the next step, the identified patterns are checked and refined. Overlapping or similar patterns are merged to create a clear and concise overview.
- Measuring Frequency: Once the main patterns have been identified, the algorithm measures how often they occur. This is done by analyzing the responses of Chatbot A and Chatbot B to a large number of questions. The more frequently a particular pattern occurs, the more likely it is that significant bias is present.
- Evaluating Results: In the final step, the results are evaluated. The algorithm provides an overview of the most prominent patterns that strongly indicate potential bias.
Important to Remember:
- The bias enumeration algorithm is a complex process that uses advanced language models and statistical methods. The steps described above are a simplified representation of the actual functioning of the algorithm.
- The output of the algorithm must be interpreted carefully. It is important to remember that the algorithm can only detect patterns in language use. It cannot definitively determine whether a chatbot is actually biased.
In essence, the bias enumeration algorithm is a tool for identifying potential bias in chatbot responses. It does this by systematically analyzing patterns in language use and measuring the frequency of these patterns.
Results
In terms of general quality metrics, the analysis showed no statistically significant differences across demographic groups. On average, responses generated by the chatbots were of comparable quality regardless of the demographic traits of the users. This lack of significant difference, at first glance, suggested that the chatbot responses were unbiased in terms of quality.
Although no significant differences were found in general quality metrics between different demographic groups, specific instances were identified where certain responses were rated as qualitatively better for one group compared to another. This highlights an important nuance in the ongoing investigation into bias in chatbots. It points to the reality that bias can manifest in subtle ways that aren’t always immediately evident through broad, general metrics.
For instance, one method researchers used to detect bias was “harmful stereotype detection”. While the overall response quality was consistent across different genders, the chatbot was more likely to recommend technical careers to users with male names. The figure below from the study further illustrates that tasks involving open-ended content, like “write a story,” tended to have higher ratings for harmful stereotypes, especially in older models such as GPT-3.5-turbo, which at times reached ratings above 2% for this category.
While the bias rates for most models remained below 1%. These subtle biases, such as gender preferences in career recommendations, underline significant issues that require careful attention.
ChatGPT’s Memory Functionality
The analysis of the sources shows that there is a significant difference in the ratings of harmful stereotypes between the memory and custom instructions functionalities in ChatGPT. Although both mechanisms lead to bias (non-significant), the severity of the bias differs. When chatbots use memory to store and recall usernames, they exhibit a higher degree of harmful stereotypes compared to when they use custom instructions.
A regression analysis estimated a slope of 2.15 (95% CI: 1.98, 2.32), indicating that the ratings of harmful stereotypes were more than twice as high for interactions involving memory. This means that, although both mechanisms show bias, the bias associated with memory is much more severe. Despite the difference in severity, the correlation between the two methods was high, with a correlation coefficient of 0.94 (p < 10⁻³⁹). This suggests that the patterns of bias are similar in both methods, but the severity of the bias differs.
Although the overall rates of harmful stereotypes are low, the analysis shows that interactions involving memory consistently have higher ratings of harmful stereotypes. This means that using memory to remember usernames poses a greater risk of bias than using custom instructions.
Thus, storing usernames and other personal details raises important privacy concerns. You may unknowingly share sensitive information, which the chatbot might retain and use in future interactions. To better understand how these privacy issues are addressed, including best practices for maintaining privacy when interacting with chatbots, you can read more in this detailed analysis here: ChatGPT Privacy – Tilburg AI.
Reinforcement Learning
The figure below compares harmful gender stereotype ratings in language models before and after the application of Reinforcement Learning (RL). RL is a training technique that improves models by reinforcing desired behaviors, often using feedback from human evaluations or predefined criteria. In this context, RL is used to reduce biases by guiding the model to generate more equitable responses.
The figure below compares harmful gender stereotype ratings in language models before and after applying Reinforcement Learning (RL). RL is a machine learning technique that shapes the models iterating through feedback loops. This process involves the model taking actions (such as generating responses) and receiving feedback in the form of rewards or penalties, based on predefined criteria or human evaluations. The goal is to reinforce outputs that align with the desired outcomes.
The graph illustrates a decrease in average harmfulness ratings for gender stereotypes after applying RL. Each data point, representing a task, falls below the 45-degree line, showing that the final models exhibit fewer harmful stereotypes compared to their pre-RL versions. This trend is consistent across models like GPT-3.5, GPT-4, GPT-4o, and GPT-4o-mini, suggesting that RL and other post-training mitigations reduce gender biases across a range of tasks.
The slopes next to each model (e.g., 0.31 for GPT-3.5t and 0.08 for GPT-4o-mini) represent the degree of bias reduction, with lower slopes indicating more substantial decreases. GPT-4o-mini, with a slope of 0.08, shows the sharpest reduction in harmful stereotype ratings from pre- to post-RL. On the other hand, GPT-4t, with the highest slope (0.37), indicates that while RL did mitigate biases, its impact was less significant compared to the other models.
Conclusion
The absence of significant differences in overall quality metrics does not automatically imply that no bias exists. Bias can manifest in subtle, context-dependent ways that may only become apparent through a deeper analysis of individual interactions. Moreover, since the research was conducted by OpenAI itself, we should be cautious in interpreting the results.
As seen in other studies of LLMs (including Gemini 1.5 Pro, Claude 3 Opus, GPT-4o, and Llama3 70b), there are discrepancies in the representation of gender, race, and age, both in professional scenarios and in criminal contexts. This could lead to associations of AI models with usernames. Therefore, it’s important to be mindful of the data you share with chatbots. Information tied to your name, demographic traits, or personal preferences could influence future responses and potentially be stored in memory, affecting how the chatbot interacts with you over time.
Resources
Eloundou, T., Beutel, A., Robinson, D. G., Gu-Lemberg, K., Brakman, A.-L., Mishkin, P., Shah, M., Heidecke, J., Weng, L., & Kalai, A. T. (2024, October 15). First-person fairness in chatbots.
Mirza, V., Kulkarni, R., & Jadhav, A. (2024). Evaluating Gender, Racial, and Age Biases in Large Language Models: A Comparative Analysis of Occupational and Crime Scenarios. arXiv preprint arXiv:2409.14583.