Find helpful tutorials or share inspiring use cases on AI technology for higher education.

Shared content may not reflect the policies of Tilburg University on the use of AI. 

Want to Understand AI? Turns Out, Attention Is All You Need!

Introduction

In 2017, researchers at Google published a paper titled “Attention is All You Need.” This paper introduced the Transformer, a new neural network that has profoundly changed the field of artificial intelligence, especially Natural Language Processing (NLP). In this article, we explain step by step and in understandable language what the essence of this paper is and why it has had such a significant impact.

Prior to 2017, most language models would process text one word at a time, but then a team of researchers at Google introduced a new model known as the transformer. Transformers don’t read text from the start to the finish, they soak it all in at once in parallel. The very first step inside a transformer, and most other language models for that matter, is to associate each word with a long list of numbers. The reason for this is that the training process only works with continuous values, so you have to somehow encode language using numbers, and each of these lists of numbers may somehow encode the meaning of the corresponding word.

What makes transformers unique is their reliance on a special operation known as attention. This operation gives all of these lists of numbers a chance to talk to one another and refine the meanings that they encode based on the context around, all done in parallel. For example, the numbers encoding the word bank might be changed based on the context surrounding it to somehow encode the more specific notion of a river bank.

To Grasp the main idea

In 2017, Google researchers published the paper “Attention is All You Need,” where they introduced the Transformer.

Before the Transformer, AI models often relied on Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. These earlier models helped computers interpret language to some extent, but they struggled with longer sequences of words or sentences. The issue was that when these models dealt with long sentences, they often lost critical information, leading to inaccurate or incomplete responses. Think of a model trying to read a sentence but remembering only the beginning and end, while losing track of the important middle section, which is essential for fully grasping the sentence’s meaning.

The Transformer architecture changed this by introducing the concept of self-attention. Unlike previous models that processed language word by word in a sequence, the Transformer can analyze all words in a sentence simultaneously and assess how each word relates to the others. This ability allows the Transformer to prioritize words based on their importance within the context of the sentence.

Consider a simple example: when you’re typing a message on your phone and you start the sentence with “I’m going to the…”, your phone might suggest “store” or “cinema” as the next word, based on the broader context. Models like the Transformer can predict such outcomes because they evaluate the entire sentence’s structure, not just the last few words. The self-attention mechanism allows the model to see the relationships between all the words, making predictions and language generation much more accurate and coherent.

Before the Transformer, many AI systems worked similarly to basic autocomplete: they would try to predict the next word in a sentence one step at a time, without fully understanding the entire sentence’s context. This led to awkward, often disjointed results because the model wasn’t able to hold onto key information from earlier parts of the sentence. With the Transformer, this limitation has been overcome. It can now more effectively capture the key components of a sentence and generate logical, fluent responses.

For example, if you start a sentence with: “The dog ate his food because…” a Transformer-based AI can logically continue the sentence with something like “he was hungry,” because the model understands the meaning of the entire sentence and identifies the relevant relationships between the words.

In short, the Transformer has significantly advanced the ability of AI models to process and generate language. It’s far more than an improved version of autocomplete, it equips AI models like ChatGPT with the capability to generate contextually appropriate, natural-sounding sentences, even in complex conversations. This was much harder to achieve before the Transformer architecture was introduced. Now, thanks to this technology, AI can better grasp the meaning and context of human language, without being restricted to predicting words in a linear, word-by-word manner.

To look more in-depth

The Problem with Traditional Models

Before the introduction of the Transformer, the use of Recurrent Neural Networks (RNNs) was deemed as best practice to process language. These models read text word by word, from start to finish. While this approach seems logical, RNNs struggle with long sentences because they can ‘forget’ important information from the beginning by the time they reach the end. This issue is known as the “vanishing gradient problem.

The Solution: The Transformer

The Transformer changed this by introducing a new mechanism called self-attention. Instead of processing words one by one, the Transformer looks at all the words in a sentence simultaneously and understands how they are connected. This makes the model faster and better at understanding long texts.

From Input to Output: The Complete Transformer Process

Tokens: The Building Blocks of Text

Before a computer can process text, it must be broken down into smaller units that the machine can understand. These units are called tokens. A token can be a single word, a part of a word (subword), or even a single character, depending on the tokenization (method in which these tokens are represented) method used. In natural language processing (NLP), tokens are the basic units of text that models, such as language models (e.g., GPT-4), use to process and generate language. Tokens dictate how the model processes the text. The model doesn’t “see” the full text but rather interacts with the tokens. Each token has a unique meaning for the model, and it uses these tokens to generate responses or understand the input.

Think of tokens as the building blocks for any text the model interacts with. The process of splitting text into tokens is called tokenization.

Example of Tokenization:

In this case, the model splits the sentence into 11 tokens, even though it contains 48 characters. This happens because some tokens are shorter (like “a”) or single punctuation marks (like spaces or periods) also count as tokens. Also, it is important to notice, that a token is not necessary a word. A rule of thumb commonly used is that a token is approximately 0.75 word.

In another sentence: “Many words map to one token, but some don’t: indivisible.”

Here, 14 tokens are produced from 57 characters because the tokenization process breaks down longer, complex words like “indivisible” into smaller segments.

Embeddings: Converting Words into Numbers with Meaning

After the text is divided into tokens, they need to be converted into a numerical form that the Transformer can process. This is done using embeddings. An embedding is a vector (a list of numbers), such as [0,0,1,1,0] or [66, 53, 43, 0, 12], which represents the meaning of a token in a numerical form.

Imagine a dictionary where each word is linked to a unique code. This code, the embedding, represents the meaning of the word. Words with similar meanings have similar codes.

Thus, words like “dog” and “cat” will have embeddings that are close to each other because they are both pets. On the other hand, words like “dog” and “table” will be further apart in the embedding space, as they have less in common with each other.

However, this way of representing can also lead to bias in the model. For example, the word “top scorer” might be closer to “man” than to “woman” in the embedding space. As a result, the model might be inclined to give a man as an answer to the question “Who is the top scorer of all time?” even if there are also female top scorers.

In the example below with the table, embeddings are used to represent different roles (King, Queen, Princess, Boy) in a vector form based on their properties (such as “Royal”, “Male”, “Female”, and “Age”).

By using embeddings of representation, a machine can learn that “King” and “Queen” share similar properties but differ based on gender, while a “Boy” is less likely to be considered a royal title.

Encoder: Meaning Extraction and Encoding

Now that we have a sequence of numbers for each word in the sentence, we need to analyze them, and this is where the encoder comes into play. The encoder is responsible for analyzing the sequence of tokens (the input sentence or text) and extracting its meaning. Simply put, an encoder is an algorithm that converts an input sequence of our embeddings into another representation, called an encoding. This encoding summarizes the essence of the input in a form that a computer can understand and use for various tasks.

Think of an encoder as a translator that distills a complex story into a concise summary without losing the essence.

Encoders Before the “Attention Is All You Need” Paper

Before the advent of the Transformer architecture, described in the paper “Attention Is All You Need,” Recurrent Neural Networks (RNNs) were the standard for encoders in NLP tasks. RNNs process text sequentially, word by word. Although they can learn context from previous words, they struggle with long sentences. This is due to the “vanishing gradient problem,” where the information from earlier words fades as the sentence gets longer.

Disadvantages of pre-transformer encoders:

  • Sequential processing: Slow processing speed, especially with long sentences.
  • Limited context: Difficulty in capturing long-distance relationships between words.
  • Vanishing gradient problem: Loss of information from earlier words in long sentences.

The Paper “Attention Is All You Need” and its Implications

The paper “Attention Is All You Need” introduced a revolutionary architecture: the Transformer. The Transformer introduced a new mechanism called self-attention, which radically changed the way encoders work.

Multi-Head Attention: Different Perspectives on the Input

Instead of using a single attention mechanism, Multi-Head Attention splits the input data into an x number of smaller parts, or in more complex jargon more manageable subspaces.

Think of a team of translators working together on a complex text. Each translator specializes in a particular field or writing style. By combining their expertise, they produce a more accurate and nuanced translation than a single translator could.

Advantages of Multi-Head Attention:

  • Richer representations: By analyzing the input from different perspectives, Multi-Head Attention uncovers complex relationships and nuances that might be overlooked with a single attention mechanism.
  • Parallel processing: The x number of attention heads work simultaneously, significantly increasing processing speed.
  • Robustness: Even if one attention head captures less relevant information, the other heads can compensate for this, making the model more robust.

Scaled Dot-Product Attention: Measuring similarity, weighing relevance

Now that we have an x number of subspaces, they need to be analyzed by a separate Scaled Dot-Product Attention mechanism, an “attention head,” which focuses on specific aspects or patterns in the data.

Imagine you are looking for a specific book in a library. You have a vague description (the query) and need to compare it with the titles of all books (the keys) to find the right book. Scaled Dot-Product Attention works in a similar way.

This mechanism calculates the relevance between words (or more technical jargon, between tokens in a sequence) using dot products. The dot products measure the “match” between a query vector and a series of key vectors.

  • The query vector represents the question that the model asks about a specific token. Imagine that the model is trying to determine how relevant a particular word is for understanding the sentence as a whole. The query vector reflects this question.
  • The key vectors, one for each token in the sequence, serves as a label or reference point for each word. The key vector of a word contains information about the meaning and context of that word.

The Scaled Dot-Product Attention mechanism calculates the similarity between the query vector and each of the key vectors by computing their dot product. The larger the dot product, the more similarity there is between the query and the corresponding keyword. In other words, the more relevant the token (word, part of a word) is in the sentence.

Think of a search engine that ranks a list of relevant web pages based on your search query. The higher the match between your search query and the content of a page, the higher it appears in the results.

Full Operation:

  1. Calculate dot products: For each word, the similarity with every other word in the sequence is determined by computing the dot products between the corresponding question (query) and key vectors (token sequence). This step helps in assessing how much focus each word should receive in relation to the current word being processed.
  2. Scaling: To ensure that the dot products do not become excessively large, especially in higher-dimensional spaces, they are divided by the square root of the dimension of the keys. This scaling process is crucial and is the reason behind the term “Scaled” in Scaled Dot-Product Attention.
  3. Calculate attention weights: The scaled dot products are then normalized, which converts them into a range between 0 and 1. These normalized values, known as attention weights, reflect the importance or relevance of each word with respect to the current query.
  4. Construct context vector: The attention weights are used to weigh the corresponding value vectors, and these weighted vectors are summed to form a context vector. This context vector encapsulates the model’s “attention,” concentrating on the most pertinent parts of the input sequence.

Positional Encodings: Order in a Parallel World

The Transformer processes all words in a sentence simultaneously. This parallel nature increases efficiency but sacrifices the sequential information inherent in language. To address this, the Transformer introduces positional encodings.

The combination of Multi-Head Attention ,Scaled Dot-Product Attention, and Positional Encodings forms the core of the Transformer. By combining these mechanisms, the Transformer can handle sequential data in a parallel, context-sensitive, and position-aware manner. This has resulted in groundbreaking outcomes in various NLP tasks and has opened the door to new possibilities in the world of artificial intelligence.

Self-attention in the Transformer encoder analyzes all tokens in the input simultaneously, rather than sequentially. This means that the encoder can understand the relationships between all words in a sentence at once, regardless of their position. This parallel processing significantly increases processing speed, especially with long prompts. Moreover, the Transformer is more resistant to the “vanishing gradient problem” due to its ability to directly establish relationships between distant words.

4. Decoder: From Code to Coherent Output

The decoder takes the encoded representation from the encoder as input and uses it to generate the final output (for example, a translation, summary, or answer to a question). The decoder also uses self-attention and feedforward networks, as well as a special type of attention called “encoder-decoder attention,” to extract relevant information from the encoded representation.

Step-by-Step Operation of the Decoder

  1. The decoder receives the encoded representation from the encoder as input.
  2. Using self-attention, the decoder analyzes the relationships between the tokens that have already been generated in the output.
  3. Encoder-decoder attention is used to extract relevant information from the encoded representation of the encoder.
  4. Feedforward networks further refine the representations of the tokens.
  5. This process is repeated until the decoder generates a special “end-of-sequence” token, indicating that the output is complete.

Conclusion

The introduction of the Transformer and the attention mechanism has led to advancements in NLP applications. Models like GPT-3 and GPT-4 (on which ChatGPT is based) have been made possible by this architecture. They are capable of understanding and generating long texts, making translations, writing summaries, and answering questions with a level of coherence and context understanding that was previously not possible.

The paper “Attention is All You Need” has changed how natural language is processed with computers. By using self-attention and the Transformer architecture, models can handle complex language patterns more efficiently and effectively. This has opened the door to advanced applications in artificial intelligence and has improved the communication with technology.

Resources

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 30). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf