9 AI 101: 2.2 Intro to Transformers and NLP

2.2 Gen AI: AI Foundations

Sun, 05 Apr 26

2.2 Gen AI: AI Foundations

Overview

Course: AI LITERACY — covering fundamentals of AI, differences between ML and Deep Learning, and applications of Neural Networks
Objectives:

Understand fundamentals of AI
Differentiate ML vs. Deep Learning
Apply knowledge of Neural Networks and Deep Learning

1 Overview

What is AI?

Branch of Computer Science — performs tasks that usually require human intelligence
Tasks include: reasoning, learning, problem solving, perception
Hierarchy: AI > ML > Deep Learning

Evolution of AI

1950s–1980s: Rule-based AI (if-then logic)
1990s–2010s: Machine Learning era — learn from data, statistical models
2010s–now: Deep Learning — neural networks, self-learning
2020s–now: Generative AI revolution — AI creates text, images, and code

Types of AI

Narrow AI (Weak AI): chatbots and recommendation systems for specific tasks
General AI (Strong AI): hypothetical AI with human-like intelligence and reasoning
Super Intelligent AI: theoretical AI surpassing human intelligence

Key Components of AI

Computer Vision
Robotics and automation
Edge AI and IoT integration
Ethics and responsible AI
AI frameworks and tools

Real-World Applications

Virtual assistants and chatbots
AI in education
Finance and fraud detection
Healthcare and medical imaging
Recommendation systems
Smart home and IoT
Autonomous vehicles

Benefits of AI

Automation and efficiency
Enhanced decision-making
Personalization
Improved accuracy
24/7 availability
Cost savings
Enhanced security

Challenges of AI

Data privacy and security
High implementation costs
Job displacement
Bias and fairness
Lack of transparency
Dependence on quality data

Why AI is Powerful

Enables innovation and new business models
Scalability and global reach
Agility, faster decision-making, and competitive advantage

Rise of Machine Learning

Shift from rule-based to data-driven approaches
From expert systems to statistical learning
Advancement in algorithms
Neural network revival

Types of Machine Learning

Supervised Learning

Uses labeled data for training — each item has a predefined label or tag
Model learns from examples: given input + expected output, learns to predict without labels
Example: classifying emails as spam or not spam using labeled email data

Semi-Supervised Learning

Combines a small amount of labeled data with a large amount of unlabeled data
Improves model efficiency without requiring fully labeled datasets
Example: sorting large image libraries into landscape and portrait using a few labeled images

Unsupervised Learning

Analyzes and clusters unlabeled data to uncover hidden patterns and groupings
No predefined labels — model figures out its own clusters
Example: grouping customers into segments based on purchasing behavior (customer segmentation)

Reinforcement Learning (ref: GeeksforGeeks)

Learns through trial and error — receives rewards (positive value) or penalties (negative value) for specific actions
Goal of the algorithm: maximize cumulative reward
Example: training a self-driving car — penalize running a red light (negative value), reward stopping (positive value)
Example: robot vacuum cleaner navigating a room by avoiding obstacles
Key components:

Agent — decision-maker that performs actions
Environment — world or system in which the agent operates
State — current situation or condition of the agent
Action — moves the agent can make
Reward — feedback or result from the environment based on the agent’s action

Class Q&A — can you reverse the reward system (reward bad actions)?

Technically yes — it’s a mathematical function, can be designed either way
Practical use case: cybersecurity bad actors, reverse-engineering normal system behavior
No practical reason to do this in standard applications

Applications of ML in Business Operations

Sales forecasting
Supply chain optimization — logistics efficiency, inventory forecasting, mitigating supply risk
Customer segmentation — grouping customers by purchase size, region, product type, demographics, age
Churn prediction — identify customers likely to leave, enabling proactive outreach
Fraud detection — e.g., banks detecting unusual card activity and flagging or blocking transactions
HR analytics — workforce planning, employee performance

Netflix Case Study

Challenge (2000s–early 2010s): high customer churn as streaming competitors emerged
Data leveraged: viewing history, ratings, search queries, regional preferences
Solution: ML-powered recommendation system

Collaborative filtering — recommends based on what similar user segments watch
Content-based filtering — recommends based on what the individual user has previously watched
Regional content strategy — different content libraries tailored by geography

Outcome: reduced churn by personalizing the experience to match user preferences

Deep Learning

Subset of Machine Learning using neural networks with vast amounts of data
Mimics the human brain
Works with both structured and unstructured data (images, text, audio, video)
Surpasses traditional ML via neural networks
Extracts complex features and achieves higher accuracy
Feature engineering is automated (unlike ML, which requires manual feature engineering)
Requires high compute — GPUs and large RAM for training

Amazon Alexa Case Study

Challenge: build robust speech recognition handling diverse languages and accents (launched 2014)
Required: transcribe voice data → text → take action (control smart home, answer queries)
Solution: leveraged Deep Learning

Recurrent Neural Networks (RNNs)
Deep Neural Networks (DNNs)

Outcome: real-time voice processing understanding context and intent across global accents

Deep Learning vs. Machine Learning

Deep LearningMachine LearningScopeSubset of ML, focuses on training deep neural networksBroad field of training algorithmsData typeExcels with unstructured data (images, audio, video)Works with structured and unstructured dataFeature engineeringAutomatedManual — performance depends on quality of engineered featuresComputeRequires GPUs and large RAMRuns on standard CPU

ML and Deep Learning Applications

Healthcare: medical imaging, drug discovery, personalised treatment, detecting eye diseases (e.g., Google DeepMind)
Finance and banking: fraud detection, financial forecasting, stock trading bots
Automotive and transportation: self-driving vehicles
Agriculture: crop monitoring and optimisation

Neural Networks

Inspired by the human brain — billions of neurons connecting and communicating
Deep Learning mimics this with artificial neurons (mathematical nodes) processing data through layers

Key Neural Network Components

Input layer: receives raw data (images, text, numbers)
Hidden layer: performs computations and extracts patterns using weights and activation functions

Depth of neural network = number of hidden layers
More hidden layers → more complex patterns uncovered → more compute required
Large Language Models have millions/billions of hidden layers — hence expensive to train

Output layer: produces predictions or classifications based on learned patterns
Weights: determine strength of connections between neurons
Activation functions: determine whether a neuron activates by transforming input signals

How Neural Networks Work

Forward propagation: data moves through layers to produce an output
Backpropagation: adjusts weights based on errors — weight update process

Example: loan/credit card application

Inputs: age, income, zip code, education level
Weights initialized (e.g., income = 0.6, age = 0.3, gender = 0.2)
Model updates weights during training to reflect actual influence on output (approved/not approved)
Loss function = optimization target (yes/no qualification)

Activation functions: determine if a neuron should activate

Types: softmax, tanh, ReLU, sigmoid — used depending on the use case

Types of Neural Networks

Artificial Neural Networks (ANNs) — complex data patterns
Deep Neural Networks — multiple layers of nodes, large-scale datasets
Recurrent Neural Networks (RNNs) — sequential data; limitation: short-term memory (forgets early context in long sequences)

LSTM (Long Short-Term Memory) — improved version of RNN for time-series data

Convolutional Neural Networks (CNNs) — image processing (e.g., facial recognition, detecting pedestrians)

Choosing ML vs. Deep Learning

Key decision factors:

Problem type — classification, clustering, regression?
Volume of data — small → ML; large → Deep Learning
Data type — image → CNN; sequential/time-series → RNN/LSTM; text/audio → transformer-based models
Compute resources available
Balance of dataset — imbalanced data (e.g., spam vs. not spam) → XGBoost/gradient boosting recommended

Practical approach: train multiple models and compare performance

ML Algorithms and Libraries

Frameworks and libraries already exist in Python — no need to build from scratch
Example: import xgboost → instantiate → build and train your own model
Common supervised learning algorithms: linear regression, logistic regression, decision tree, random forest, support vector machine, K-nearest neighbour, gradient boosting (XGBoost)
XGBoost: go-to for both regression and classification, handles imbalanced datasets well

2 Intro to Transformers

Engagement Prompt

Scenario: product manager at a growing tech company exploring AI to improve customer engagement
Competitors already leveraging: AI-powered chatbots, personalised recommendations, automated content
Key models introduced: BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), T5

Lesson Objectives

Explain fundamental concepts of transformers and NLP — architecture, components, state-of-the-art applications
Analyse how transformers process text using tokenization, embeddings, positional encoding
Apply NLP techniques: classification, sentiment analysis, chatbots
Evaluate the impact of advanced AI models on industries

Attention Mechanism

Foundational paper: “Attention Is All You Need” — Vaswani et al., Google, 2017

arxiv.org/pdf/1706.03762

Transformers do not process data sequentially — analyse all words at once using self-attention
Attention mechanism helps transformers pay attention to the most relevant words in a sentence, even if far apart
Solves the core limitation of RNNs: forgetting early context in long sequences

Attention Mechanism Types

Soft attention
Hard attention
Self-attention ← primary focus
Multi-head attention ← primary focus
Encoder-decoder attention
Hierarchical attention

Intro to Transformer Models

Type of deep learning that leverages self-attention mechanisms for simultaneous processing of sequence elements
RNNs process data sequentially — each step handled one after another
Transformers process all elements simultaneously — enabled by self-attention and positional encoding

Self-Attention

Key component in NLP — enables the network to focus on specific words or phrases to improve context understanding
Word order still matters (e.g., reversing words in a sentence breaks meaning)
Positional encoding preserves word order without sequential processing
Analogy: reading a book

You don’t memorise every sentence — you pay attention to key elements
You can summarise and even predict the ending based on themes and flow
Transformers do the same — attend to the most relevant parts of the input

Mechanics Behind Self-Attention

Self-attention layer calculates 3 vectors from each encoder input vector:

Query Vector (Q) — scores each word regarding the extent of attention it needs
Key Vector (K) — scores the attentiveness (attractiveness) of each word
Value Vector (V) — represents the actual word content, generates the final output

Similarity distance calculated between words using: cosine similarity, Euclidean distance, or dot product
During training, vectors are iteratively trained and updated
Equation defines the attention score for each input word (softmax applied to Q·K / √d_k × V)

Self-Attention Analogy — Party

Listening: each person (data point / word in sentence) listens to stories (inputs) of others in the room (sequence)
Scoring: assign a score to each storyteller based on relevance of their story (query-key matching)
Focusing: more attention to stories with higher scores
Combining: create a summary weighted by how much attention was paid to each person (weighted sum of values)
Your total experience from the event = sum of all interactions — same as transformer’s final output

Transformer Model Architecture

Reference: Attention Is All You Need — arxiv.org/pdf/1706.03762

Original transformer architecture = encoder + decoder
Some Large Language Models use only encoder (e.g., BERT), only decoder (e.g., GPT), or both (e.g., T5)

🧠 FULL FLOW (NOW WITH TRANSLATION)

👉 Input: “I love pizza” 👉 Output: French: “J’aime la pizza”

🔹 PART 1: INPUT (Encoder side)

1. Tokenization

“I love pizza” → [“I”, “love”, “pizza”]

Tokenization = breaking input into smaller chunks (tokens can be words, sub-words, or characters)

2. Embedding + Position

“I” → position 1
“love” → position 2
“pizza” → position 3
Each token converted to a vector (multi-dimensional array of numbers — the language models understand)
Vector encodes both: semantic meaning of the word in context + position it occupies in the sentence
Example: “bank” means different things in “river bank” vs. “deposit money in the bank” — embedding captures the contextual meaning

🔹 PART 2: ENCODER (Understand English)

3. Multi-Head Attention

✅ Self-Attention

Model figures out meaning relationships:

“love” → connects I ↔ pizza
“I” → subject
“pizza” → object

Each word pays attention to all other words in the sentence simultaneously
Multi-head = multiple attention heads running in parallel, each capturing different relationships

4. Add & Norm

Layer normalisation added after each sub-layer
Controls divergence as data moves from one layer to the next
Prevents values from drifting too far between layers

5. Feed Forward

Position-wise feed-forward neural network
Forward propagation: processes input moving from one layer to the next
Each layer refines the representation — intermediate outputs progressively refined

6. Add & Norm

Layer normalisation applied again after feed-forward
Refines understanding further

🔁 Repeat (Nx)

Entire encoder block repeated N times (could be 6, 10, millions of transformer blocks)
Final meaning stored as context vectors — rich representations containing word meaning + position

🔹 PART 3: DECODER (Generate French)

7. Start Output

Decoder begins with a start token
Begins generating: “J’” (means “I” in French)
Input to decoder = entire sequence of outputs from the encoder (enriched vectors)

8. Masked Multi-Head Attention

✅ Self-Attention (masked)

At “J’” → nothing before it
At each next word → looks at previous output tokens only
Masking ensures each position only attends to earlier positions in the output sequence
Preserves the autoregressive property necessary for coherent generation

9. Add & Norm

Layer normalisation applied after masked self-attention

10. Encoder–Decoder Attention

❌ Not self-attention — this is cross-attention

Decoder looks back at the English encoder output:

“I” → “J’”
“love” → “aime”
“pizza” → “pizza”

Ensures only the most crucial information from the input sequence informs the output
Equivalent to: after attending to everyone at the party, your output (experience) is shaped by the sum of all meaningful interactions

11. Add & Norm

Layer normalisation applied after encoder-decoder attention

12. Feed Forward

Position-wise feed-forward neural network in the decoder
Further refines the output representation

13. Add & Norm

Final layer normalisation in the decoder block

🔁 Repeat (Nx)

Decoder block also repeated N times
Each repetition further refines the French output being generated

🔹 STEP-BY-STEP OUTPUT BUILDING

Step 1: “I” → “J’”

Step 2: “I love” → “J’aime”

Step 3: “I love pizza” → “J’aime la pizza”

Note: model adds “la” — not present in the English input
This is French grammar — the model is not copying, it is reconstructing meaning in the target language

🧩 WHAT THIS SHOWS (IMPORTANT)

The model is NOT just copying words
It is:

Understanding meaning
Rebuilding it in another language
Adding grammar rules not present in the source

🧩 WHERE SELF-ATTENTION HAPPENS

PlaceWhat happensEncoderEnglish words relate to each otherDecoder (masked)French words relate to previous French wordsEncoder–DecoderConnects English → French

⚡ SIMPLE SUMMARY

Encoder = understand English
Decoder = write French
Attention = connect meaning between them

🧠 ONE-LINE MEMORY

Transformer = “Understand the sentence → rebuild it step-by-step in another form”

Probability and Output Generation

Large Language Models are probabilistic models
Output generation mechanics:

Softmax activation function applied at output layer
Produces a probability distribution over all possible next tokens
Picks the most probable token (top-K sampling)
Process repeats token by token until full output generated

You don’t see the probability — you see the output
Example: when Copilot/Gemini generates code from a comment, it’s applying probability to determine the most likely next syntax element

Coming Up Next (Lesson 3)

Text processing techniques
Masked attention — what exactly is masking
Bidirectional encoders (BERT)
GPT architecture deep dive
Activation functions — types and when to use each
Deeper dive into neural network training mechanics (bias, weights)

Resources Shared

GeeksforGeeks — Reinforcement Learning
W3Schools — ML fundamentals
Attention Is All You Need paper: arxiv.org/pdf/1706.03762
Course survey — to be completed after class

Next Steps

Michael Chang (students)

Complete the post-class survey shared in chat
Read Attention Is All You Need paper for deeper context on transformer architecture
Review GeeksforGeeks and W3Schools links shared during class for reinforcement learning and ML types
Revisit the ML vs. Deep Learning comparison slide before next session

Instructor

Next session: begin with text processing, then cover masked attention, BERT, GPT architecture, activation functions, and bias/weight mechanics in depth
Revisit transformer architecture diagram at start of next session to consolidate understanding

Sunday, April 5, 2026

2.2 Intro to Transformers and NLP

2.2 Gen AI: AI Foundations

Overview

1 Overview

What is AI?

Evolution of AI

Types of AI

Key Components of AI

Real-World Applications

Benefits of AI

Challenges of AI

Why AI is Powerful

Rise of Machine Learning

Types of Machine Learning

Applications of ML in Business Operations

Netflix Case Study

Deep Learning

Amazon Alexa Case Study

Deep Learning vs. Machine Learning

ML and Deep Learning Applications

Neural Networks

Key Neural Network Components

How Neural Networks Work

Types of Neural Networks

Choosing ML vs. Deep Learning

ML Algorithms and Libraries

2 Intro to Transformers

Engagement Prompt

Lesson Objectives

Attention Mechanism

Attention Mechanism Types

Intro to Transformer Models

Self-Attention

Mechanics Behind Self-Attention

Self-Attention Analogy — Party

Transformer Model Architecture

🧠 FULL FLOW (NOW WITH TRANSLATION)

🔹 PART 1: INPUT (Encoder side)

1. Tokenization

2. Embedding + Position

🔹 PART 2: ENCODER (Understand English)

3. Multi-Head Attention

4. Add & Norm

5. Feed Forward

6. Add & Norm

🔁 Repeat (Nx)

🔹 PART 3: DECODER (Generate French)

7. Start Output

8. Masked Multi-Head Attention

9. Add & Norm

10. Encoder–Decoder Attention

11. Add & Norm

12. Feed Forward

13. Add & Norm

🔁 Repeat (Nx)

🔹 STEP-BY-STEP OUTPUT BUILDING

🧩 WHAT THIS SHOWS (IMPORTANT)

🧩 WHERE SELF-ATTENTION HAPPENS

⚡ SIMPLE SUMMARY

🧠 ONE-LINE MEMORY

Probability and Output Generation

Coming Up Next (Lesson 3)

Resources Shared

Next Steps

No comments:

Post a Comment