Terminology
- Artificial Intelligence is concerned with understanding and engineering intelligence.
- Machine Learning is relevant to AI as it is concerned with learning functions from data.
- Deep Learning is a sub-field of Machine Learning where we are able to learn increasingly complex functions using the (Deep) Neural Networks.
- Neural Networks are a (large) series of linear and non-linear transformations that can generally approximate any function.
- Gradient Descent is a method used to optimize some objective function (also called as loss function) by tweaking the parameters of some model (eg: a neural network) by using the gradient information (partial derivatives of loss w.r.t. the parameters).
- Backpropagation is the method used to calculate the gradients of the objective function w.r.t. the parameters of the model.
- Supervised learning is a learning setup where there is an instructive feedback — for a given input and the model’s prediction on the input, we give the model the expected output to learn from.
- Unsupervised learning is where there is no explicit feedback as such (neither from data nor from the model designer).
- Reinforcement learning - the feedback is evaluative — instead of telling the model what the expected output is we tell the model how good the prediction is (reward), this feedback need not be immediate!
- Self supervised learning - there is some not-so-hard transformation from a unlabelled dataset (no feedback) to a labelled dataset (yes feedback) so we can leverage supervised learning.
- Transformers
- a class of neural networks that does really well on sequence modelling, can handle large input sequences and scale really well and can be trained efficiently unlike RNNs
- Attention - helps model learn what to pay attention to. these are the only layers that can move information between the token positions of a sequence.
Intro to Large Language Models
https://www.youtube.com/watch?v=zjkBMFhNj_g
-
A LLM is a set of two files. eg: Llama-2-70b

-
Model inference - running the run.c program that takes parameters and runs an interactive model.
-
Model training - how to obtain parameters? (below Llama 2 70B)
- Chunk of the internet - 10TB of text
- 6000 GPUs for 12 days, ~$2M, ~ 1e24 FLOPS
- ~140GB
- Computationally very expensive
- Obtain base model
-
Neural networks
- can be used to predict the next word in the sequence
- prediction is compression
- forces the network to learn about the world (as described by the text)
-
Network “dreams” internet documents…
