Julia Gradient Descent

Partial Derivatives and Gradient

Let \(F(x,y)=(x-3y-2)^{2}+(x-4y-5)^{2}\) use the Chain Rule
- The chain rule is one of the key concepts in calculus, where it is a means to differentiate composite functions. Additionally, it is a basic component of several other calculus techniques and functions, such as implicit differentiation, differentiation of inverse functions, and integration by substitution

There is even ChainRules.jl!

an entire julia package dedicated to automatic differentiation independent set of rules, and a system for defining and testing rules.

\(\frac{\delta F}{\delta x} = 2(x-3y-2) + 2(x-4y-5)\)

\(\frac{\delta F}{\delta y} = 2(x-3y-2)(-3)+2(x-4y-5)(-4)\)

\(\Delta F(x,y) = [ \frac{\delta F}{\delta x} & \frac{\delta F}{\delta y}]\)

Chain Rule

if \(f(u)\) is differentiable at \(u=g(x)\) and \(g(x)\) is differentiable at \(x\)

then the composite function \((f \circ g)(x)=f(g(x))\) is differentiable at \(x\)
- and \((f \circ g)^{'}(x)=f^{'}(g(x))\cdot g^{'}(x)\)

the Leibniz notation is easier to understand

\(\frac{dx}{dy}=\frac{dy}{du}\cdot\frac{du}{dx}\)

Chain Rule Practice

Suppose that the carbon monoxide pollution in the air is changing at the rate of 0.02 ppm (parts per million) for each person in a city whose population is growing at the rate of 1000 people per year.

Find the rate at which the level of pollution is increasing with respect to time
- Remember that the symbol \(\frac{dy}{dx}\) stands for…
  - the derivative of \(y\)
    - with respect to \(x\)
      - and represents the rate of change of \(y\) with respect to \(x\)

let \(\frac{dP}{dx} = 0.02\) ppm/person let \(\frac{dx}{dt}=1000\) people/year

_Chain rule: the rate pollution increase w.r.t. time \(\frac{dP}{dt}=\frac{dP}{dx}\frac{dx}{dt} = 0.002 \times 1000\)

the level of pollution is increasing at a rate of 20 ppm/year

Linear Regression

a linear model makes a prediction
- by simply computing a weighted sum of the input features
  - plus a constant called the bias term or intercept term

linear regression model prediction \(\hat{y} = \theta_{0}+\theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n}\)

\(\hat{y}\) is the predicted value
\(n\) is the number of features
\(x_{i}\) is the i-th feature value
\(\theta_{j}\) is the j-th model parameter
- including the intercept \(\theta_{0}\) and the feature weights \(\theta_{1},...,\theta_{n}\)

linear regression model prediction (vectorized) \(\hat{y}=h_{\theta}(x)= \theta \cdot x\)

\(h_{\theta}\) is the hypothesis function using model parameters \(\theta\)
\(\theta\) is the model's parameter vector
- containing the intercept and feature weights
\(x\) is the instance's feature vector
- containing \(x_{0},...,x_{n}\) with \(x_{0}\) always equal to 1
\(\theta \cdot x\) is the dot product of the vectors \(\theta\) and \(x\)
- which is equal to \(\theta_{0}x_{0}+...+\theta_{n}x_{n}\)

How do we train the model?

we first need a measure of how well or poorly the model fits the training data
- the common performance measure is the root mean square error (RMSE)
  
  To train a linear regression model, we need to find the value of theta that minimizes the RMSE
  - it is simpler to minimize mean square error and it leads to the same result

MSE cost function for a linear regression model

\(MSE(X,h_{\theta})= \frac{1}{m} \sum_{i=1}^{m}(\theta^{T}x^{i}-y^{i})^{2}\)

to find the value of \(\theta\) that minimizes the cost of MSE
- there is a closed-form solutions—a math equation
  
  the normal equation
  
  \(\hat{\theta} = (X^{T}X)^{-1}X^{T} y\)
  - \(\hat\theta\) is the value of theta that minimizes the cost function
  - \(y\) is the vector of target values containing \(y^1\) to \(y^m\)

FluxML Linear Regression

Generate a Dataset

generate a matrix with numbers
- ranging from -3.0 to 3.0 w/ a gap of 0.1 between them
```
using Flux
x = hcat(collect(Float32,-3:0.1:3)...)
summary(x)
```
generate the corresponding labels
- or the \(y\) s
- the function \(f\) maps each \(x\) to a \(y\)
  - and as \(x\) is a `Matrix`
    - the expression broadcasts the scalar values using `@.` macro
NOTE: these data points are too perfect and real-world data is messy
```
using Flux
x = hcat(collect(Float32,-3:0.1:3)...)
f(x) = @. 3x + 2;
y = f(x)
first(y, 5)
```
```
using Flux, Plots
x = hcat(collect(Float32,-3:0.1:3)...)
x = x .* reshape(rand(Float32, 61), (1, 61));
f(x) = @. 3x + 2;
y = f(x)
plot(vec(x), vec(y), lw = 3, seriestype = :scatter, label = "", title = "Generated data", xlabel = "x", ylabel= "y");
```
- make sure to see a graphical view of the dataset before building a model
  - this will help make a decision on what approach will be the best fit

Building a Model

linear regression in mathematical form \(model(W,b,x)= Wx+b\)

where the W is the weight matrix
- which constitute only a single element for one a single feature
and b is the bias

linregmodel(W, b, x) = @. W*x + b

the `@.` macro allows you to perform the calculations by broadcasting the scalar quantities

next we initialize the model parameters
- the weight W and the bias b

lets pull out the weight from a uniform distribution and initialize the bias to 0

using Random

x = hcat(collect(Float32,-3:0.1:3)...)
x = x .* reshape(rand(Float32, 61), (1, 61));
f(x) = @. 3x + 2;
y = f(x)
linregmodel(W, b, x) = @. W*x + b

W = rand(Float32, 1, 1)

b = [0.0f0]

size(linregmodel(W,b,x)), first(linregmodel(W,b,x)), first(y)

the model works but the predictions are way off!

we need to train the mdoel to improve the predictions
- before that we need to define a loss function
  - the loss function should output a quantity
    - that we will try minimize during training

mean sum squared error loss function

using Random

x = hcat(collect(Float32,-3:0.1:3)...)
x = x .* reshape(rand(Float32, 61), (1, 61));
f(x) = @. 3x + 2;
y = f(x)
linregmodel(W, b, x) = @. W*x + b

W = rand(Float32, 1, 1)

b = [0.0f0]

function lossfunc(weights, biases, features, labels)
    ŷ = linregmodel(weights, biases, features)
    sum((labels .- ŷ).^2) / length(features)
end

lossfunc(W, b, x, y)

the loss function is called on our \(x\) s and \(y\) s
- it shows how far our predictions( \(\hat{y}\) ) are from the real labels

it calculates the sum of squares of residuals and divides it by the total number of data points

now lets start actually using Flux

under the hood Flux calculates the output using the same expression
- Flux will initialize the parameters for us

using Flux
using Random

Random.seed!(47)

x = hcat(collect(Float32,-3:0.1:3)...)
x = x .* reshape(rand(Float32, 61), (1, 61));
f(x) = @. 3x + 2;
y = f(x)
linregmodel(W, b, x) = @. W*x + b

fluxmodel = Dense(1 => 1)
W = fluxmodel.weight
b = [0.0f0]

function lossfunc(weights, biases, features, labels)
    ŷ = linregmodel(weights, biases, features)
    sum((labels .- ŷ).^2) / length(features)
end

function fluxloss(fluxmodel, features, labels)
    ŷ = fluxmodel(features)
    Flux.mse(ŷ, labels)
end


size(fluxmodel(x)), first(fluxmodel(x)), first(y), fluxloss(fluxmodel, x, y), lossfunc(W, b, x, y)

the losses are identical!
- meaning the two models are identical on some level

Training the Model

Lets train the model using Gradient Descent

according to gradient descent algorithm
- the weights and biases should be iteratively updated using the the following mathematical equations
  
  \(W=W-\eta * \frac{dL}{dW}\)
  
  \(b = b - \eta * \frac{dL}{db}\)
  
  \(W\) is the weight matrix
  
  \(b\) is the bias vector
  
  \(\eta\) is the learning rate
  
  \(\frac{dL}{dW}\) is derivative of the loss function w.r.t. weight
  
  \(\frac{dL}{db}\) is the derivative of the loss function w.r.t bias

derivatives (in Flux) are calculated using an Automatic Differentiation tool

Zygote.jl

First step

obtain the gradient of the loss function
- with respect to the weights and the biases

using Flux
using Random

Random.seed!(47)

x = hcat(collect(Float32,-3:0.1:3)...)
x = x .* reshape(rand(Float32, 61), (1, 61));
f(x) = @. 3x + 2;
y = f(x)
linregmodel(W, b, x) = @. W*x + b

fluxmodel = Dense(1 => 1)
W = fluxmodel.weight
b = [0.0f0]

function lossfunc(weights, biases, features, labels)
    ŷ = linregmodel(weights, biases, features)
    sum((labels .- ŷ).^2) / length(features)
end


dLdW, dLdb, _, _ = gradient(lossfunc, W, b, x, y)

W .= W .- 0.1 .* dLdW
b .= b .- 0.1 .* dLdb

W,b,lossfunc(W,b,x,y)

we can see the loss went down, meaning we have suceeded in training the model (for one epoch)

we should plug the training code above into a loop

and train for more epochs

either using a fixed number of epochs or stop loss

using Flux
using Random

Random.seed!(47)

x = hcat(collect(Float32,-3:0.1:3)...)
x = x .* reshape(rand(Float32, 61), (1, 61));
f(x) = @. 3x + 2;
y = f(x)
linregmodel(W, b, x) = @. W*x + b

fluxmodel = Dense(1 => 1)
W = fluxmodel.weight
b = [0.0f0]

function lossfunc(weights, biases, features, labels)
    ŷ = linregmodel(weights, biases, features)
    sum((labels .- ŷ).^2) / length(features)
end


function train!(f_loss, weights, biases, features, labels)
    dLdW, dLdb, _, _ = gradient(f_loss, weights, biases, features, labels)
    @. weights = weights - 0.1 * dLdW
    @. biases = biases - 0.1 * dLdb
end
for i ∈ 1:40 # epochs
    train!(lossfunc, W, b, x, y)
end

W, b, lossfunc(W, b, x, y)

#+RESULTS:

(Float32[2.9997993;;], Float32[1.9996992], 1.3315619f-7)

we minimized the loss from 16 –(1 epoch)–> 10 —(40 epochs)— 1.33

Go ahead and plot the the fitted line on top of the \(x\) s and \(y\) src with this code

Remember Wx + b is nothing more than a line's equation

slope = W[1]
y-intercept = b[1]

plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = "", title = "Simple Linear Regression", xlabel = "x", ylabel= "y");
plot!((x) -> b[1] + W[1] * x, -3, 3, label="Custom model", lw=2);

Next to improve the model you can play with optimisers, # of epochs, learning rate, etc.

Watch that loss go down!