3.4 Attention Mechanism & Transformers

Sun, 03 May 26

Addresses traditional neural network limitations
- Information bottleneck with long sequences
- Sequential processing loses context over time
- Struggles with long-term dependencies
Attention allows models to focus on relevant input parts
- Assigns weights to important information
- Creates contextual computation through weighted sums
- Enables parallel processing vs sequential

Each token pays attention to all other tokens including itself
Creates direct connections between distant words in sequence
- Eliminates need to traverse entire sequence
- Maintains relationships regardless of position
Uses three key vectors for calculation:
1. Query vector - word being focused on
2. Key vector - all words in sentence for comparison
3. Value vector - returned result based on attention scores
Softmax normalization ensures scores range 0-1 (probability distribution)

Multiple independent attention heads process input simultaneously
Each head calculates attention from different perspective
- Captures various semantic relationships
- Provides nuanced understanding of context
Outputs combined for comprehensive representation
More heads = better performance but higher computational cost

Apple example demonstrates semantic understanding:
- “Apple was juicy” vs “Apple stock crashed”
- Same word, different meanings based on context
- Self-attention calculates relationships to determine correct interpretation
Animal pronoun resolution:
- “The animal didn’t cross the street because it was tired”
- Model determines “it” refers to animal, not street
- Through attention weight calculations between tokens

GPT models (decoder-only) for text generation
BERT (encoder-only) for bidirectional understanding
DALL-E for image generation from text descriptions
Industry applications:
- Healthcare: medical record analysis
- Legal: document processing
- Customer service: automated responses
Model selection considerations:
- Task complexity vs computational cost
- Context length requirements affect pricing
- Smaller models often sufficient for simple tasks

Demo covered sentiment analysis using DistilBERT
- Positive/negative iPhone review classification
- High confidence scores (>0.9) for both examples
Text generation using GPT-2 for customer service responses
Model selection flexibility through transformer libraries
Recommendation to experiment with different models for performance comparison

9 AI 101