Sunday, April 5, 2026

2.2 Intro to Transformers and NLP

2.2 Gen AI: AI Foundations

Sun, 05 Apr 26

2.2 Gen AI: AI Foundations

Overview

  • Course: AI LITERACY — covering fundamentals of AI, differences between ML and Deep Learning, and applications of Neural Networks

  • Objectives:

    1. Understand fundamentals of AI

    2. Differentiate ML vs. Deep Learning

    3. Apply knowledge of Neural Networks and Deep Learning


1 Overview

What is AI?

  • Branch of Computer Science — performs tasks that usually require human intelligence

  • Tasks include: reasoning, learning, problem solving, perception

  • Hierarchy: AI > ML > Deep Learning

Evolution of AI

  • 1950s–1980s: Rule-based AI (if-then logic)

  • 1990s–2010s: Machine Learning era — learn from data, statistical models

  • 2010s–now: Deep Learning — neural networks, self-learning

  • 2020s–now: Generative AI revolution — AI creates text, images, and code

Types of AI

  • Narrow AI (Weak AI): chatbots and recommendation systems for specific tasks

  • General AI (Strong AI): hypothetical AI with human-like intelligence and reasoning

  • Super Intelligent AI: theoretical AI surpassing human intelligence

Key Components of AI

  • Computer Vision

  • Robotics and automation

  • Edge AI and IoT integration

  • Ethics and responsible AI

  • AI frameworks and tools

Real-World Applications

  • Virtual assistants and chatbots

  • AI in education

  • Finance and fraud detection

  • Healthcare and medical imaging

  • Recommendation systems

  • Smart home and IoT

  • Autonomous vehicles

Benefits of AI

  • Automation and efficiency

  • Enhanced decision-making

  • Personalization

  • Improved accuracy

  • 24/7 availability

  • Cost savings

  • Enhanced security

Challenges of AI

  • Data privacy and security

  • High implementation costs

  • Job displacement

  • Bias and fairness

  • Lack of transparency

  • Dependence on quality data

Why AI is Powerful

  • Enables innovation and new business models

  • Scalability and global reach

  • Agility, faster decision-making, and competitive advantage


Rise of Machine Learning

  • Shift from rule-based to data-driven approaches

  • From expert systems to statistical learning

  • Advancement in algorithms

  • Neural network revival

Types of Machine Learning

  • Supervised Learning

    • Uses labeled data for training — each item has a predefined label or tag

    • Model learns from examples: given input + expected output, learns to predict without labels

    • Example: classifying emails as spam or not spam using labeled email data

  • Semi-Supervised Learning

    • Combines a small amount of labeled data with a large amount of unlabeled data

    • Improves model efficiency without requiring fully labeled datasets

    • Example: sorting large image libraries into landscape and portrait using a few labeled images

  • Unsupervised Learning

    • Analyzes and clusters unlabeled data to uncover hidden patterns and groupings

    • No predefined labels — model figures out its own clusters

    • Example: grouping customers into segments based on purchasing behavior (customer segmentation)

  • Reinforcement Learning (ref: GeeksforGeeks)

    • Learns through trial and error — receives rewards (positive value) or penalties (negative value) for specific actions

    • Goal of the algorithm: maximize cumulative reward

    • Example: training a self-driving car — penalize running a red light (negative value), reward stopping (positive value)

    • Example: robot vacuum cleaner navigating a room by avoiding obstacles

    • Key components:

      • Agent — decision-maker that performs actions

      • Environment — world or system in which the agent operates

      • State — current situation or condition of the agent

      • Action — moves the agent can make

      • Reward — feedback or result from the environment based on the agent’s action

    • Class Q&A — can you reverse the reward system (reward bad actions)?

      • Technically yes — it’s a mathematical function, can be designed either way

      • Practical use case: cybersecurity bad actors, reverse-engineering normal system behavior

      • No practical reason to do this in standard applications

Applications of ML in Business Operations

  • Sales forecasting

  • Supply chain optimization — logistics efficiency, inventory forecasting, mitigating supply risk

  • Customer segmentation — grouping customers by purchase size, region, product type, demographics, age

  • Churn prediction — identify customers likely to leave, enabling proactive outreach

  • Fraud detection — e.g., banks detecting unusual card activity and flagging or blocking transactions

  • HR analytics — workforce planning, employee performance

Netflix Case Study

  • Challenge (2000s–early 2010s): high customer churn as streaming competitors emerged

  • Data leveraged: viewing history, ratings, search queries, regional preferences

  • Solution: ML-powered recommendation system

    • Collaborative filtering — recommends based on what similar user segments watch

    • Content-based filtering — recommends based on what the individual user has previously watched

    • Regional content strategy — different content libraries tailored by geography

  • Outcome: reduced churn by personalizing the experience to match user preferences


Deep Learning

  • Subset of Machine Learning using neural networks with vast amounts of data

  • Mimics the human brain

  • Works with both structured and unstructured data (images, text, audio, video)

  • Surpasses traditional ML via neural networks

  • Extracts complex features and achieves higher accuracy

  • Feature engineering is automated (unlike ML, which requires manual feature engineering)

  • Requires high compute — GPUs and large RAM for training

Amazon Alexa Case Study

  • Challenge: build robust speech recognition handling diverse languages and accents (launched 2014)

  • Required: transcribe voice data → text → take action (control smart home, answer queries)

  • Solution: leveraged Deep Learning

    • Recurrent Neural Networks (RNNs)

    • Deep Neural Networks (DNNs)

  • Outcome: real-time voice processing understanding context and intent across global accents

Deep Learning vs. Machine Learning

Deep LearningMachine LearningScopeSubset of ML, focuses on training deep neural networksBroad field of training algorithmsData typeExcels with unstructured data (images, audio, video)Works with structured and unstructured dataFeature engineeringAutomatedManual — performance depends on quality of engineered featuresComputeRequires GPUs and large RAMRuns on standard CPU

ML and Deep Learning Applications

  • Healthcare: medical imaging, drug discovery, personalised treatment, detecting eye diseases (e.g., Google DeepMind)

  • Finance and banking: fraud detection, financial forecasting, stock trading bots

  • Automotive and transportation: self-driving vehicles

  • Agriculture: crop monitoring and optimisation


Neural Networks

  • Inspired by the human brain — billions of neurons connecting and communicating

  • Deep Learning mimics this with artificial neurons (mathematical nodes) processing data through layers

Key Neural Network Components

  • Input layer: receives raw data (images, text, numbers)

  • Hidden layer: performs computations and extracts patterns using weights and activation functions

    • Depth of neural network = number of hidden layers

    • More hidden layers → more complex patterns uncovered → more compute required

    • Large Language Models have millions/billions of hidden layers — hence expensive to train

  • Output layer: produces predictions or classifications based on learned patterns

  • Weights: determine strength of connections between neurons

  • Activation functions: determine whether a neuron activates by transforming input signals

How Neural Networks Work

  • Forward propagation: data moves through layers to produce an output

  • Backpropagation: adjusts weights based on errors — weight update process

    • Example: loan/credit card application

      • Inputs: age, income, zip code, education level

      • Weights initialized (e.g., income = 0.6, age = 0.3, gender = 0.2)

      • Model updates weights during training to reflect actual influence on output (approved/not approved)

      • Loss function = optimization target (yes/no qualification)

  • Activation functions: determine if a neuron should activate

    • Types: softmax, tanh, ReLU, sigmoid — used depending on the use case

Types of Neural Networks

  • Artificial Neural Networks (ANNs) — complex data patterns

  • Deep Neural Networks — multiple layers of nodes, large-scale datasets

  • Recurrent Neural Networks (RNNs) — sequential data; limitation: short-term memory (forgets early context in long sequences)

    • LSTM (Long Short-Term Memory) — improved version of RNN for time-series data

  • Convolutional Neural Networks (CNNs) — image processing (e.g., facial recognition, detecting pedestrians)

Choosing ML vs. Deep Learning

Key decision factors:

  1. Problem type — classification, clustering, regression?

  2. Volume of data — small → ML; large → Deep Learning

  3. Data type — image → CNN; sequential/time-series → RNN/LSTM; text/audio → transformer-based models

  4. Compute resources available

  5. Balance of dataset — imbalanced data (e.g., spam vs. not spam) → XGBoost/gradient boosting recommended

  • Practical approach: train multiple models and compare performance

ML Algorithms and Libraries

  • Frameworks and libraries already exist in Python — no need to build from scratch

  • Example: import xgboost → instantiate → build and train your own model

  • Common supervised learning algorithms: linear regression, logistic regression, decision tree, random forest, support vector machine, K-nearest neighbour, gradient boosting (XGBoost)

  • XGBoost: go-to for both regression and classification, handles imbalanced datasets well


2 Intro to Transformers

Engagement Prompt

  • Scenario: product manager at a growing tech company exploring AI to improve customer engagement

  • Competitors already leveraging: AI-powered chatbots, personalised recommendations, automated content

  • Key models introduced: BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), T5

Lesson Objectives

  1. Explain fundamental concepts of transformers and NLP — architecture, components, state-of-the-art applications

  2. Analyse how transformers process text using tokenization, embeddings, positional encoding

  3. Apply NLP techniques: classification, sentiment analysis, chatbots

  4. Evaluate the impact of advanced AI models on industries


Attention Mechanism

  • Foundational paper: “Attention Is All You Need” — Vaswani et al., Google, 2017

  • Transformers do not process data sequentially — analyse all words at once using self-attention

  • Attention mechanism helps transformers pay attention to the most relevant words in a sentence, even if far apart

  • Solves the core limitation of RNNs: forgetting early context in long sequences

Attention Mechanism Types

  • Soft attention

  • Hard attention

  • Self-attention ← primary focus

  • Multi-head attention ← primary focus

  • Encoder-decoder attention

  • Hierarchical attention


Intro to Transformer Models

  • Type of deep learning that leverages self-attention mechanisms for simultaneous processing of sequence elements

  • RNNs process data sequentially — each step handled one after another

  • Transformers process all elements simultaneously — enabled by self-attention and positional encoding

Self-Attention

  • Key component in NLP — enables the network to focus on specific words or phrases to improve context understanding

  • Word order still matters (e.g., reversing words in a sentence breaks meaning)

  • Positional encoding preserves word order without sequential processing

  • Analogy: reading a book

    • You don’t memorise every sentence — you pay attention to key elements

    • You can summarise and even predict the ending based on themes and flow

    • Transformers do the same — attend to the most relevant parts of the input

Mechanics Behind Self-Attention

Self-attention layer calculates 3 vectors from each encoder input vector:

  1. Query Vector (Q) — scores each word regarding the extent of attention it needs

  2. Key Vector (K) — scores the attentiveness (attractiveness) of each word

  3. Value Vector (V) — represents the actual word content, generates the final output

  • Similarity distance calculated between words using: cosine similarity, Euclidean distance, or dot product

  • During training, vectors are iteratively trained and updated

  • Equation defines the attention score for each input word (softmax applied to Q·K / √d_k × V)

Self-Attention Analogy — Party

  • Listening: each person (data point / word in sentence) listens to stories (inputs) of others in the room (sequence)

  • Scoring: assign a score to each storyteller based on relevance of their story (query-key matching)

  • Focusing: more attention to stories with higher scores

  • Combining: create a summary weighted by how much attention was paid to each person (weighted sum of values)

  • Your total experience from the event = sum of all interactions — same as transformer’s final output


Transformer Model Architecture

Reference: Attention Is All You Need — arxiv.org/pdf/1706.03762

  • Original transformer architecture = encoder + decoder

  • Some Large Language Models use only encoder (e.g., BERT), only decoder (e.g., GPT), or both (e.g., T5)


🧠 FULL FLOW (NOW WITH TRANSLATION)

👉 Input: “I love pizza” 👉 Output: French: “J’aime la pizza”


🔹 PART 1: INPUT (Encoder side)

1. Tokenization

“I love pizza” → [“I”, “love”, “pizza”]

  • Tokenization = breaking input into smaller chunks (tokens can be words, sub-words, or characters)

2. Embedding + Position

  • “I” → position 1

  • “love” → position 2

  • “pizza” → position 3

  • Each token converted to a vector (multi-dimensional array of numbers — the language models understand)

  • Vector encodes both: semantic meaning of the word in context + position it occupies in the sentence

  • Example: “bank” means different things in “river bank” vs. “deposit money in the bank” — embedding captures the contextual meaning


🔹 PART 2: ENCODER (Understand English)

3. Multi-Head Attention

✅ Self-Attention

  • Model figures out meaning relationships:

    • “love” → connects I ↔ pizza

    • “I” → subject

    • “pizza” → object

  • Each word pays attention to all other words in the sentence simultaneously

  • Multi-head = multiple attention heads running in parallel, each capturing different relationships

4. Add & Norm

  • Layer normalisation added after each sub-layer

  • Controls divergence as data moves from one layer to the next

  • Prevents values from drifting too far between layers

5. Feed Forward

  • Position-wise feed-forward neural network

  • Forward propagation: processes input moving from one layer to the next

  • Each layer refines the representation — intermediate outputs progressively refined

6. Add & Norm

  • Layer normalisation applied again after feed-forward

  • Refines understanding further

🔁 Repeat (Nx)

  • Entire encoder block repeated N times (could be 6, 10, millions of transformer blocks)

  • Final meaning stored as context vectors — rich representations containing word meaning + position


🔹 PART 3: DECODER (Generate French)

7. Start Output

  • Decoder begins with a start token

  • Begins generating: “J’” (means “I” in French)

  • Input to decoder = entire sequence of outputs from the encoder (enriched vectors)

8. Masked Multi-Head Attention

✅ Self-Attention (masked)

  • At “J’” → nothing before it

  • At each next word → looks at previous output tokens only

  • Masking ensures each position only attends to earlier positions in the output sequence

  • Preserves the autoregressive property necessary for coherent generation

9. Add & Norm

  • Layer normalisation applied after masked self-attention

10. Encoder–Decoder Attention

❌ Not self-attention — this is cross-attention

  • Decoder looks back at the English encoder output:

    • “I” → “J’”

    • “love” → “aime”

    • “pizza” → “pizza”

  • Ensures only the most crucial information from the input sequence informs the output

  • Equivalent to: after attending to everyone at the party, your output (experience) is shaped by the sum of all meaningful interactions

11. Add & Norm

  • Layer normalisation applied after encoder-decoder attention

12. Feed Forward

  • Position-wise feed-forward neural network in the decoder

  • Further refines the output representation

13. Add & Norm

  • Final layer normalisation in the decoder block

🔁 Repeat (Nx)

  • Decoder block also repeated N times

  • Each repetition further refines the French output being generated


🔹 STEP-BY-STEP OUTPUT BUILDING

Step 1: “I” → “J’”

Step 2: “I love” → “J’aime”

Step 3: “I love pizza” → “J’aime la pizza”

  • Note: model adds “la” — not present in the English input

  • This is French grammar — the model is not copying, it is reconstructing meaning in the target language


🧩 WHAT THIS SHOWS (IMPORTANT)

  • The model is NOT just copying words

  • It is:

    • Understanding meaning

    • Rebuilding it in another language

    • Adding grammar rules not present in the source


🧩 WHERE SELF-ATTENTION HAPPENS

PlaceWhat happensEncoderEnglish words relate to each otherDecoder (masked)French words relate to previous French wordsEncoder–DecoderConnects English → French


⚡ SIMPLE SUMMARY

  • Encoder = understand English

  • Decoder = write French

  • Attention = connect meaning between them


🧠 ONE-LINE MEMORY

Transformer = “Understand the sentence → rebuild it step-by-step in another form”


Probability and Output Generation

  • Large Language Models are probabilistic models

  • Output generation mechanics:

    • Softmax activation function applied at output layer

    • Produces a probability distribution over all possible next tokens

    • Picks the most probable token (top-K sampling)

    • Process repeats token by token until full output generated

  • You don’t see the probability — you see the output

  • Example: when Copilot/Gemini generates code from a comment, it’s applying probability to determine the most likely next syntax element

Coming Up Next (Lesson 3)

  • Text processing techniques

  • Masked attention — what exactly is masking

  • Bidirectional encoders (BERT)

  • GPT architecture deep dive

  • Activation functions — types and when to use each

  • Deeper dive into neural network training mechanics (bias, weights)


Resources Shared


Next Steps

  • Michael Chang (students)

    • Complete the post-class survey shared in chat

    • Read Attention Is All You Need paper for deeper context on transformer architecture

    • Review GeeksforGeeks and W3Schools links shared during class for reinforcement learning and ML types

    • Revisit the ML vs. Deep Learning comparison slide before next session

  • Instructor

    • Next session: begin with text processing, then cover masked attention, BERT, GPT architecture, activation functions, and bias/weight mechanics in depth

    • Revisit transformer architecture diagram at start of next session to consolidate understanding


No comments:

Post a Comment