2.2 Gen AI: AI Foundations
Sun, 05 Apr 26
2.2 Gen AI: AI Foundations
Overview
Course: AI LITERACY — covering fundamentals of AI, differences between ML and Deep Learning, and applications of Neural Networks
Objectives:
Understand fundamentals of AI
Differentiate ML vs. Deep Learning
Apply knowledge of Neural Networks and Deep Learning
1 Overview
What is AI?
Branch of Computer Science — performs tasks that usually require human intelligence
Tasks include: reasoning, learning, problem solving, perception
Hierarchy: AI > ML > Deep Learning
Evolution of AI
1950s–1980s: Rule-based AI (if-then logic)
1990s–2010s: Machine Learning era — learn from data, statistical models
2010s–now: Deep Learning — neural networks, self-learning
2020s–now: Generative AI revolution — AI creates text, images, and code
Types of AI
Narrow AI (Weak AI): chatbots and recommendation systems for specific tasks
General AI (Strong AI): hypothetical AI with human-like intelligence and reasoning
Super Intelligent AI: theoretical AI surpassing human intelligence
Key Components of AI
Computer Vision
Robotics and automation
Edge AI and IoT integration
Ethics and responsible AI
AI frameworks and tools
Real-World Applications
Virtual assistants and chatbots
AI in education
Finance and fraud detection
Healthcare and medical imaging
Recommendation systems
Smart home and IoT
Autonomous vehicles
Benefits of AI
Automation and efficiency
Enhanced decision-making
Personalization
Improved accuracy
24/7 availability
Cost savings
Enhanced security
Challenges of AI
Data privacy and security
High implementation costs
Job displacement
Bias and fairness
Lack of transparency
Dependence on quality data
Why AI is Powerful
Enables innovation and new business models
Scalability and global reach
Agility, faster decision-making, and competitive advantage
Rise of Machine Learning
Shift from rule-based to data-driven approaches
From expert systems to statistical learning
Advancement in algorithms
Neural network revival
Types of Machine Learning
Supervised Learning
Uses labeled data for training — each item has a predefined label or tag
Model learns from examples: given input + expected output, learns to predict without labels
Example: classifying emails as spam or not spam using labeled email data
Semi-Supervised Learning
Combines a small amount of labeled data with a large amount of unlabeled data
Improves model efficiency without requiring fully labeled datasets
Example: sorting large image libraries into landscape and portrait using a few labeled images
Unsupervised Learning
Analyzes and clusters unlabeled data to uncover hidden patterns and groupings
No predefined labels — model figures out its own clusters
Example: grouping customers into segments based on purchasing behavior (customer segmentation)
Reinforcement Learning (ref: GeeksforGeeks)
Learns through trial and error — receives rewards (positive value) or penalties (negative value) for specific actions
Goal of the algorithm: maximize cumulative reward
Example: training a self-driving car — penalize running a red light (negative value), reward stopping (positive value)
Example: robot vacuum cleaner navigating a room by avoiding obstacles
Key components:
Agent — decision-maker that performs actions
Environment — world or system in which the agent operates
State — current situation or condition of the agent
Action — moves the agent can make
Reward — feedback or result from the environment based on the agent’s action
Class Q&A — can you reverse the reward system (reward bad actions)?
Technically yes — it’s a mathematical function, can be designed either way
Practical use case: cybersecurity bad actors, reverse-engineering normal system behavior
No practical reason to do this in standard applications
Applications of ML in Business Operations
Sales forecasting
Supply chain optimization — logistics efficiency, inventory forecasting, mitigating supply risk
Customer segmentation — grouping customers by purchase size, region, product type, demographics, age
Churn prediction — identify customers likely to leave, enabling proactive outreach
Fraud detection — e.g., banks detecting unusual card activity and flagging or blocking transactions
HR analytics — workforce planning, employee performance
Netflix Case Study
Challenge (2000s–early 2010s): high customer churn as streaming competitors emerged
Data leveraged: viewing history, ratings, search queries, regional preferences
Solution: ML-powered recommendation system
Collaborative filtering — recommends based on what similar user segments watch
Content-based filtering — recommends based on what the individual user has previously watched
Regional content strategy — different content libraries tailored by geography
Outcome: reduced churn by personalizing the experience to match user preferences
Deep Learning
Subset of Machine Learning using neural networks with vast amounts of data
Mimics the human brain
Works with both structured and unstructured data (images, text, audio, video)
Surpasses traditional ML via neural networks
Extracts complex features and achieves higher accuracy
Feature engineering is automated (unlike ML, which requires manual feature engineering)
Requires high compute — GPUs and large RAM for training
Amazon Alexa Case Study
Challenge: build robust speech recognition handling diverse languages and accents (launched 2014)
Required: transcribe voice data → text → take action (control smart home, answer queries)
Solution: leveraged Deep Learning
Recurrent Neural Networks (RNNs)
Deep Neural Networks (DNNs)
Outcome: real-time voice processing understanding context and intent across global accents
Deep Learning vs. Machine Learning
Deep LearningMachine LearningScopeSubset of ML, focuses on training deep neural networksBroad field of training algorithmsData typeExcels with unstructured data (images, audio, video)Works with structured and unstructured dataFeature engineeringAutomatedManual — performance depends on quality of engineered featuresComputeRequires GPUs and large RAMRuns on standard CPU
ML and Deep Learning Applications
Healthcare: medical imaging, drug discovery, personalised treatment, detecting eye diseases (e.g., Google DeepMind)
Finance and banking: fraud detection, financial forecasting, stock trading bots
Automotive and transportation: self-driving vehicles
Agriculture: crop monitoring and optimisation
Neural Networks
Inspired by the human brain — billions of neurons connecting and communicating
Deep Learning mimics this with artificial neurons (mathematical nodes) processing data through layers
Key Neural Network Components
Input layer: receives raw data (images, text, numbers)
Hidden layer: performs computations and extracts patterns using weights and activation functions
Depth of neural network = number of hidden layers
More hidden layers → more complex patterns uncovered → more compute required
Large Language Models have millions/billions of hidden layers — hence expensive to train
Output layer: produces predictions or classifications based on learned patterns
Weights: determine strength of connections between neurons
Activation functions: determine whether a neuron activates by transforming input signals
How Neural Networks Work
Forward propagation: data moves through layers to produce an output
Backpropagation: adjusts weights based on errors — weight update process
Example: loan/credit card application
Inputs: age, income, zip code, education level
Weights initialized (e.g., income = 0.6, age = 0.3, gender = 0.2)
Model updates weights during training to reflect actual influence on output (approved/not approved)
Loss function = optimization target (yes/no qualification)
Activation functions: determine if a neuron should activate
Types: softmax, tanh, ReLU, sigmoid — used depending on the use case
Types of Neural Networks
Artificial Neural Networks (ANNs) — complex data patterns
Deep Neural Networks — multiple layers of nodes, large-scale datasets
Recurrent Neural Networks (RNNs) — sequential data; limitation: short-term memory (forgets early context in long sequences)
LSTM (Long Short-Term Memory) — improved version of RNN for time-series data
Convolutional Neural Networks (CNNs) — image processing (e.g., facial recognition, detecting pedestrians)
Choosing ML vs. Deep Learning
Key decision factors:
Problem type — classification, clustering, regression?
Volume of data — small → ML; large → Deep Learning
Data type — image → CNN; sequential/time-series → RNN/LSTM; text/audio → transformer-based models
Compute resources available
Balance of dataset — imbalanced data (e.g., spam vs. not spam) → XGBoost/gradient boosting recommended
Practical approach: train multiple models and compare performance
ML Algorithms and Libraries
Frameworks and libraries already exist in Python — no need to build from scratch
Example: import xgboost → instantiate → build and train your own model
Common supervised learning algorithms: linear regression, logistic regression, decision tree, random forest, support vector machine, K-nearest neighbour, gradient boosting (XGBoost)
XGBoost: go-to for both regression and classification, handles imbalanced datasets well
2 Intro to Transformers
Engagement Prompt
Scenario: product manager at a growing tech company exploring AI to improve customer engagement
Competitors already leveraging: AI-powered chatbots, personalised recommendations, automated content
Key models introduced: BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), T5
Lesson Objectives
Explain fundamental concepts of transformers and NLP — architecture, components, state-of-the-art applications
Analyse how transformers process text using tokenization, embeddings, positional encoding
Apply NLP techniques: classification, sentiment analysis, chatbots
Evaluate the impact of advanced AI models on industries
Attention Mechanism
Foundational paper: “Attention Is All You Need” — Vaswani et al., Google, 2017
Transformers do not process data sequentially — analyse all words at once using self-attention
Attention mechanism helps transformers pay attention to the most relevant words in a sentence, even if far apart
Solves the core limitation of RNNs: forgetting early context in long sequences
Attention Mechanism Types
Soft attention
Hard attention
Self-attention ← primary focus
Multi-head attention ← primary focus
Encoder-decoder attention
Hierarchical attention
Intro to Transformer Models
Type of deep learning that leverages self-attention mechanisms for simultaneous processing of sequence elements
RNNs process data sequentially — each step handled one after another
Transformers process all elements simultaneously — enabled by self-attention and positional encoding
Self-Attention
Key component in NLP — enables the network to focus on specific words or phrases to improve context understanding
Word order still matters (e.g., reversing words in a sentence breaks meaning)
Positional encoding preserves word order without sequential processing
Analogy: reading a book
You don’t memorise every sentence — you pay attention to key elements
You can summarise and even predict the ending based on themes and flow
Transformers do the same — attend to the most relevant parts of the input
Mechanics Behind Self-Attention
Self-attention layer calculates 3 vectors from each encoder input vector:
Query Vector (Q) — scores each word regarding the extent of attention it needs
Key Vector (K) — scores the attentiveness (attractiveness) of each word
Value Vector (V) — represents the actual word content, generates the final output
Similarity distance calculated between words using: cosine similarity, Euclidean distance, or dot product
During training, vectors are iteratively trained and updated
Equation defines the attention score for each input word (softmax applied to Q·K / √d_k × V)
Self-Attention Analogy — Party
Listening: each person (data point / word in sentence) listens to stories (inputs) of others in the room (sequence)
Scoring: assign a score to each storyteller based on relevance of their story (query-key matching)
Focusing: more attention to stories with higher scores
Combining: create a summary weighted by how much attention was paid to each person (weighted sum of values)
Your total experience from the event = sum of all interactions — same as transformer’s final output
Transformer Model Architecture
Reference: Attention Is All You Need — arxiv.org/pdf/1706.03762
Original transformer architecture = encoder + decoder
Some Large Language Models use only encoder (e.g., BERT), only decoder (e.g., GPT), or both (e.g., T5)
🧠 FULL FLOW (NOW WITH TRANSLATION)
👉 Input: “I love pizza” 👉 Output: French: “J’aime la pizza”
🔹 PART 1: INPUT (Encoder side)
1. Tokenization
“I love pizza” → [“I”, “love”, “pizza”]
Tokenization = breaking input into smaller chunks (tokens can be words, sub-words, or characters)
2. Embedding + Position
“I” → position 1
“love” → position 2
“pizza” → position 3
Each token converted to a vector (multi-dimensional array of numbers — the language models understand)
Vector encodes both: semantic meaning of the word in context + position it occupies in the sentence
Example: “bank” means different things in “river bank” vs. “deposit money in the bank” — embedding captures the contextual meaning
🔹 PART 2: ENCODER (Understand English)
3. Multi-Head Attention
✅ Self-Attention
Model figures out meaning relationships:
“love” → connects I ↔ pizza
“I” → subject
“pizza” → object
Each word pays attention to all other words in the sentence simultaneously
Multi-head = multiple attention heads running in parallel, each capturing different relationships
4. Add & Norm
Layer normalisation added after each sub-layer
Controls divergence as data moves from one layer to the next
Prevents values from drifting too far between layers
5. Feed Forward
Position-wise feed-forward neural network
Forward propagation: processes input moving from one layer to the next
Each layer refines the representation — intermediate outputs progressively refined
6. Add & Norm
Layer normalisation applied again after feed-forward
Refines understanding further
🔁 Repeat (Nx)
Entire encoder block repeated N times (could be 6, 10, millions of transformer blocks)
Final meaning stored as context vectors — rich representations containing word meaning + position
🔹 PART 3: DECODER (Generate French)
7. Start Output
Decoder begins with a start token
Begins generating: “J’” (means “I” in French)
Input to decoder = entire sequence of outputs from the encoder (enriched vectors)
8. Masked Multi-Head Attention
✅ Self-Attention (masked)
At “J’” → nothing before it
At each next word → looks at previous output tokens only
Masking ensures each position only attends to earlier positions in the output sequence
Preserves the autoregressive property necessary for coherent generation
9. Add & Norm
Layer normalisation applied after masked self-attention
10. Encoder–Decoder Attention
❌ Not self-attention — this is cross-attention
Decoder looks back at the English encoder output:
“I” → “J’”
“love” → “aime”
“pizza” → “pizza”
Ensures only the most crucial information from the input sequence informs the output
Equivalent to: after attending to everyone at the party, your output (experience) is shaped by the sum of all meaningful interactions
11. Add & Norm
Layer normalisation applied after encoder-decoder attention
12. Feed Forward
Position-wise feed-forward neural network in the decoder
Further refines the output representation
13. Add & Norm
Final layer normalisation in the decoder block
🔁 Repeat (Nx)
Decoder block also repeated N times
Each repetition further refines the French output being generated
🔹 STEP-BY-STEP OUTPUT BUILDING
Step 1: “I” → “J’”
Step 2: “I love” → “J’aime”
Step 3: “I love pizza” → “J’aime la pizza”
Note: model adds “la” — not present in the English input
This is French grammar — the model is not copying, it is reconstructing meaning in the target language
🧩 WHAT THIS SHOWS (IMPORTANT)
The model is NOT just copying words
It is:
Understanding meaning
Rebuilding it in another language
Adding grammar rules not present in the source
🧩 WHERE SELF-ATTENTION HAPPENS
PlaceWhat happensEncoderEnglish words relate to each otherDecoder (masked)French words relate to previous French wordsEncoder–DecoderConnects English → French
⚡ SIMPLE SUMMARY
Encoder = understand English
Decoder = write French
Attention = connect meaning between them
🧠 ONE-LINE MEMORY
Transformer = “Understand the sentence → rebuild it step-by-step in another form”
Probability and Output Generation
Large Language Models are probabilistic models
Output generation mechanics:
Softmax activation function applied at output layer
Produces a probability distribution over all possible next tokens
Picks the most probable token (top-K sampling)
Process repeats token by token until full output generated
You don’t see the probability — you see the output
Example: when Copilot/Gemini generates code from a comment, it’s applying probability to determine the most likely next syntax element
Coming Up Next (Lesson 3)
Text processing techniques
Masked attention — what exactly is masking
Bidirectional encoders (BERT)
GPT architecture deep dive
Activation functions — types and when to use each
Deeper dive into neural network training mechanics (bias, weights)
Resources Shared
W3Schools — ML fundamentals
Attention Is All You Need paper: arxiv.org/pdf/1706.03762
Course survey — to be completed after class
Next Steps
Michael Chang (students)
Complete the post-class survey shared in chat
Read Attention Is All You Need paper for deeper context on transformer architecture
Review GeeksforGeeks and W3Schools links shared during class for reinforcement learning and ML types
Revisit the ML vs. Deep Learning comparison slide before next session
Instructor
Next session: begin with text processing, then cover masked attention, BERT, GPT architecture, activation functions, and bias/weight mechanics in depth
Revisit transformer architecture diagram at start of next session to consolidate understanding
No comments:
Post a Comment