Week 0 - Intro to ML

Terminology

Artificial Intelligence is concerned with understanding and engineering intelligence.
Machine Learning is relevant to AI as it is concerned with learning functions from data.
Deep Learning is a sub-field of Machine Learning where we are able to learn increasingly complex functions using the (Deep) Neural Networks.
Neural Networks are a (large) series of linear and non-linear transformations that can generally approximate any function.
Gradient Descent is a method used to optimize some objective function (also called as loss function) by tweaking the parameters of some model (eg: a neural network) by using the gradient information (partial derivatives of loss w.r.t. the parameters).
Backpropagation is the method used to calculate the gradients of the objective function w.r.t. the parameters of the model.
Supervised learning is a learning setup where there is an instructive feedback — for a given input and the model’s prediction on the input, we give the model the expected output to learn from.
Unsupervised learning is where there is no explicit feedback as such (neither from data nor from the model designer).
Reinforcement learning - the feedback is evaluative — instead of telling the model what the expected output is we tell the model how good the prediction is (reward), this feedback need not be immediate!
Self supervised learning - there is some not-so-hard transformation from a unlabelled dataset (no feedback) to a labelled dataset (yes feedback) so we can leverage supervised learning.
Transformers
- a class of neural networks that does really well on sequence modelling, can handle large input sequences and scale really well and can be trained efficiently unlike RNNs
- Attention - helps model learn what to pay attention to. these are the only layers that can move information between the token positions of a sequence.

Intro to Large Language Models

https://www.youtube.com/watch?v=zjkBMFhNj_g

A LLM is a set of two files. eg: Llama-2-70b
Model inference - running the run.c program that takes parameters and runs an interactive model.
Model training - how to obtain parameters? (below Llama 2 70B)
- Chunk of the internet - 10TB of text
- 6000 GPUs for 12 days, ~$2M, ~ 1e24 FLOPS
- ~140GB
- Computationally very expensive
- Obtain base model
Neural networks
- can be used to predict the next word in the sequence
- prediction is compression
- forces the network to learn about the world (as described by the text)
Network “dreams” internet documents…