Transformers:
- Transformers are a type of deep learning model architecture introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017). Transformers are a type of neural network architecture that transforms or changes an input sequence into an output sequence. They do this by learning context and tracking relationships between sequence components.
- They are designed to handle sequential data, such as text, but unlike previous models (like RNNs or LSTMs), transformers don’t process data sequentially. Instead, they use a mechanism called self-attention to weigh the importance of different words in a sentence regardless of their position.
- The self-attention mechanism allows each token (word or subword) in the input sequence to attend to every other token in the sequence, which helps the model understand context more effectively.
- Auto Regressive Generation
- Feed forward and Self-Attention: In a Transformer model, "self-attention" is a mechanism that allows the model to focus on different parts of an input sequence by calculating relationships between elements within that sequence, while "feedforward" is a fully connected neural network layer that further processes the output from the self-attention layer, adding non-linearity and enabling the model to learn more complex patterns within the data; essentially, self-attention provides context-aware representations, and feedforward refines those representations by applying non-linear transformations
Before Transformers
- Early deep learning models that focused extensively on natural language processing (NLP) tasks aimed at getting computers to understand and respond to natural human language. They guessed the next word in a sequence based on the previous word.
- To understand better, consider the autocomplete feature in your smartphone. It makes suggestions based on the frequency of word pairs that you type. For example, if you frequently type "I am fine," your phone autosuggests fine after you type am.
- Early machine learning (ML) models applied similar technology on a broader scale. They mapped the relationship frequency between different word pairs or word groups in their training data set and tried to guess the next word. However, early technology couldn’t retain context beyond a certain input length. For example, an early ML model couldn’t generate a meaningful paragraph because it couldn’t retain context between the first and last sentence in a paragraph.
Tokenization:
- Tokenization is the process of breaking down text into smaller chunks, called tokens. These tokens can be words, subwords, or characters, depending on the granularity chosen.
- In modern NLP models like GPT or BERT, subword tokenization (e.g., using methods like Byte Pair Encoding (BPE) or SentencePiece) is commonly used because it balances between the word-level and character-level granularity, capturing a wide range of linguistic patterns.
- For example, the word "unhappiness" might be tokenized into ["un", "happiness"] or even further into smaller subword units like ["un", "happi", "ness"].
- Tokenization is crucial for transforming human-readable text into a format that a machine learning model can process.
- token is the unit of operation for the llm and
- monetization of llm happens on tokens
Embedding:
- Embeddings are low-dimensional vector representations of tokens that capture semantic relationships between them.
- In traditional NLP, each word might be represented by a unique one-hot vector, but embeddings allow words with similar meanings to have similar vector representations. For example, "king" and "queen" would be close in the embedding space.
- The transformer model typically uses positional embeddings in addition to token embeddings. Since transformers don’t inherently process data sequentially, positional embeddings provide information about the order of tokens in the input sequence.
- The embeddings are learned during the training process, and they evolve to capture semantic, syntactic, and contextual relationships between tokens.
How They Work Together:
- When text is fed into a transformer model, it first undergoes tokenization (breaking the text into tokens).
- These tokens are then mapped to their respective embeddings (vectors).
- The transformer model processes these embeddings using the self-attention mechanism to capture the relationships and contextual meaning of tokens in the sequence.
- The output embeddings can be used for tasks like text generation, classification, or translation.
- the output of a llm is a probability distribution