Intro to Flux ML

*Julia Intro Flux ML

Why Flux and What is Flux.jl?

  • a machine learning library written in 100% Julia code.

    • providing lightweight abstractions
      • on top of native GPU
      • and automatic differentiation support
    • enables researchers, engineers, and developers to write performant and customizable machine learning code

    Julia provides idiomatic and robust packages

    Flux is used extensively in the SciML ecosystem for differential equations

    Just like the core Julia lang

    • Flux can be scaled up to run on anything from CPU to GPU cluster, to probably even larger…

Building Deep Learning Models

  • Start by creating a dense layer
    • w/ a single input and output

what is dense layer? also known as the fully-connected layer

  • "neurons in a fully connected layer have full connections to all activations" (in the previous layer)
using Flux

model = Dense(1, 1)
# Flux automatically sets a random weight value for the layers created

model.weight, model.bias
  • in deep learning, weights and biases are ones of the many fine-tuning parameters
    • that helps the model to learn something of value
    • the power of deep learning is to stack layers on top of each other
  • Flux provides a chain function, which allows us to connect multiple layers together
  • so they are called in-sequence.

    l1 = Dense(1,1)
    l2 = Dense(1,1)
    model = Chain(l1,l2)
    
    • Models in flux are like predictive functions
      • we take some input and make a prediction
        • and provide an output

a dynamic state response given existing interconnected processes

  • Now lets create a convolutional layer, which will help our model do more complicated deep learning tasks
    • Using flat layers like a dense layers introduces challenges
      • the information around each pixel is lost in the learning process
        • thats where the conovolutional layer comes in
l2 = Conv((5,5), 3 => 7, relu)
# the first part is the filter, set to 5x5
#     '5x5' refers to number of pixels
#     the layer should be looking at as it filters
#     through an image
###
# After the filter we set the input and output channel sizes
#     the I/O channels depend on the data we are working with...
#     Here we use 3 inputs channels and 7 output channels
###
# Last is the activation function
#     we use the ReLU activation function
#     Activation functions act as transformations
#     that help tailor the model to our needs.

# FAKE CONVOLUTIONAL DATA
xs = rand(Float32, 100, 100, 3, 50);
# 50, 3-channel images w/ random float values associated w/ them

# we can pass these values into the convolutional layer
# and look at what the output is
l2(xs)
  • Other layers in conjunction with convolutional layers allow us to emulate what some may consider to be a close representation of the function of the brain…

Intro to Flux.jl

  • to get started

    sigmoid function (domain) (differentiability) (monotonicity) (asymptotes)

    • bounded, differentiable, real function
      • a function that maps any real-valued number into a value between 0 and 1 useful for converting outputs into probabilities
        • activation function in Machine Learning
          • neural networks for modeling binary classification
          • smoothing outputs
          • introducing non-linearity into models
          \(\sigma = \frac{1}{1+e^{-x}}\) where,
        • \(x\) is the input value
          • \(e\) is Euler's number (~2.718)

            Euler's number

            • the base of the natural logarithm and exponential function
              • helps understand growth change, patterns in nature, how population expands, and how money grows with interest…

                using Flux, Plots
                
                # built-in functions...
                σ # sigmoid function
                
                plot(σ, -5, 5, label="\\sigma", xlabel="x", ylabel="\\sigma\\(x\\)")
                
                
            • Flux allows us to automatically create neurons with `Dense` function
              • and all the supporting architecture to work with said neuron

                model = Dense(2, 1, σ)
                Dense(2 => 1, σ)    # 3 parameters
                model.weight
                1×2 Matrix{Float32}:
                 0.992837  -0.799815
                model.bias
                1-element Vector{Float32}:
                 0.0
                
                typeof(model.weight)
                Matrix{Float32} (alias for Array{Float32, 2})
                
                x = rand(2)
                2-element Vector{Float64}:
                 0.44811035782022945
                 0.6151330675136076
                model(x)
                ┌ Warning: Layer with Float32 parameters got Float64 input.
                │   The input will be converted, but any earlier layers may be very slow.
                │   layer = Dense(2 => 1, σ)    # 3 parameters
                │   summary(x) = "2-element Vector{Float64}"
                └ @ Flux C:\Users\jph6366\.julia\packages\Flux\uRn8o\src\layers\stateless.jl:60
                1-element Vector{Float32}:
                 0.48822924
                
                ## REASON THROUGH HOW THE ABOVE IS WORKING
                
                # UNLIKE PREVIOUSLY model.weight is no longer a `Vector`
                # and model.bias is no longer a number
                σ.(model.weight*x + model.bias)
                1-element Vector{Float64}:
                 0.4882292155480665
                # both are stored in `TrackedArray`s
                # and model.weight is treated as a matrix w/ a single row
                

Other built-in function

  • the mean square error measures the amount of error in statistical models
    • assessing the averaged squared difference between the observed and predicted values
      • when a model has no error the MSE equals zero
  • also known as the the mean squared deviation
    • in regression, the mean squared error represents the average squared residual
      • as data points fall closer to the regression line, the model has less error
        • less error = better predictions

\(MSE=\frac{\sum(y_{i}-\hat{y}_{i})^{2}}{n}\)

  • \(y_{i}\) is the i-th observed value
    • \(\hat{y}_{i}\) is the corresponding predicted value
      • \(n\) = the number of observations

In Flux is named Flux.mse

using CSV, DataFrames

apples = DataFrame('data', delim='\t', allowmissing=:none, normalizenames=true)
bananas = DataFrame('data', delim='\t', allowmissing=:none, normalizenames=true)

x_apples = [ [row.red, row.green] for row in eachrow(applses)]
x_bananas = [ [row.red, row.green] for row in eachrow(bananas)]

# we use this to train our model
xs = [x_apples, x_bananas]
ys = [fill(0, size(x_apples)); fill(1, size(x_bananas))];


model = Dense(2, 1, σ)

model(xs[end])
# examine the current loss value
loss = Flux.mse(model(xs[1]), ys[1])
# the model eval at first label as to compared actual label
# what is the gradient at this data point
# so we can improve the predictions...

Backpropagation

  • figure out our gradients and derivatives
model.weight.data # is our data we work with
model.weight.grad # is the gradients

using Flux.Tracker
# back-bang
back!(loss)
# mutates all the things the loss computation touched also tracked
model.weight.grad


loss = Flux.mse(model(xs[1]), ys[1])
back!(loss) # back-propagation based on loss computation
model.weight.data .-= model.weight.grad * η # eta learning rate
model.bias.data .-= model.bias.grad * η # eta learning rate

model(xs[end])
# should compute an output a little bit closer to zero

###
# we want to loop over our entire datastep a bunch of times.
# pick a random data point to do a stochastic gradient descent
#
for step in 1:1000
    i = rand(1:length(xs))
    η = 0.01
    loss = Flux.mse(model(xs[i]), ys[i])
    back!(loss
    model.weight.data .-= model.weight.grad * η
    model.bias.data .-= model.bias.grad * η
end
          # TRAINS OUR MODELS
          # By implementing stochastic gradient descent to train
          # flux makes it easier though

model(xs[1])
model(xs[end-1])
model = Dense(2, 1, σ)
L(x,y) = Flux.mse(model(x), y)
opt = SGD(params(model))
Flux.train!(L, zip(xs,ys), opt)
# ONE TRAINING STEP ON THE MODEL

# RUN IT MORE TO PERFORM BETTER FOR SUCCESSFUL TRAINING
for step in 1:100
    Flux.train!(L, zip(xs,ys), opt)
end


Intro to Neural Networks

  • A neural network is made out of a network of neurons that are connected together
    • we can use vectors of 0 or 1 values to symbolize each output

      the idea of using vectors is that different directions in space of outputs encode information about different types of inputs

Essential a linear system of equations…

linear part hidden in non-linearity

$σ(x;w1,b1):=

\frac{1}{1+exp(-w^{(1)}* x + b^{(1)})}
;$

$σ(x;w2,b2):=

\frac{1}{1+exp(-w^{(2)}* x + b^{(2)})}
;$

$σ(x;wi,bi):=

\frac{1}{1+exp(-w^{(i)}* x + b^{(i)})}
;$

linear algebra

  • represents biases as a vector
    • and the weights as a matrix
      • weights are in rows respectively to mutliple outputs

every single neuron can be imagined as matrix multiplcation

  • non-binary classifcation

Flux provides an efficient representation for one-hot vectors

  • instead of storing these vectors in memory
    • Flux records in which position the non-zero element is

      using Flux: onehot
      
      onehot(1, 1:300)
      
  • works just like julia arrays

all we need to do is train

  • don't forget to do training loops!
model = Dense(2, 3, σ) # 2 inputs 3 outputs
L(x,y) = Flux.mse(model(x), y)
opt = SGD(params(model))
Flux.train!(L, zip(xs, ys), opt)

for _ in 1:100
    Flux.train!(L, zip(xs, ys), opt)
end

Intro to Deep Learning


model = Chain(Dense(2, 4, σ), Dense(4, 3, σ))
L(x,y) = Flux.mse(model(x), y)
opt = SGD(params(model))
Flux.train!(L, zip(xs, ys), opt)
for _ in 1:1000:
    Flux.train!(L, zip(xs, ys), opt)
end

Improve efficiency by batching!

Flux.batch(xs)
# changes the matrix vector multiplication
model(Flux.batch(xs)) # TO
# matrix matrix product

databatch = (Flux.batch(xs), Flux.batch(ys))

Flux.train!(L , iterators.repeated(databatch, 10000), opt)
# instead of for loop
Iterators.repeated(...)

# CHECK LOSS FOR IMPROVMENT
L(databatch[1], databatch[2])
Flux.crossentropy(...)
# designed for probability distribution

softmax([...])
# allows normalization across an entire domain

Use softmax as a final normalization and change the loss function to `crossentropy`

model = Chain(Dense(2, 4, σ), Dense(4, 3, identity), softmax)
# σ sigmoid is important non-linearity
L(x,y) = Flux.crossentropy(model(x), y)
# slowly and iteratively refine your model
opt = SGD(params(model))
Flux.train!(L, Iterators.repeated(databatch, 5000), opt)

Recognizing handwriting with a neural network

Data


using Flux, Flux.Data.MNIST, Images

labels = MNIST.labels() # groundtruth
images = MNIST.images() # images

length(images)

images[1:5] # look at first handful of images

labels[1:5] # transposed to match the above

images[1] # individual image
size(images[1]) # 784 pixels
typeof(images[1])# array of grayscale pixel
# Array{Gray{Normed{UInt8, 8}},2}
Float64.(images[1])# the numbers backing those the grayscale pixel
# 28x28 Array{Float64,2}

NN

  • previously we arranged vector of vectors, now we use a matrices
    • the column \(i\) of the matrix is a vector consisting of the \(i\)-th data point \(X^{(i)}\)
      • the desired outputs are given as a matrix
      • with the \(i\)-th column being the desired output \(y^{(i)}\)
n_inputs = unique(length.(images))[]
n_outputs = length(unique(labels))
  • create a vector of features each with the floating points values of ninputs pixels
    • an image is a matrix of colours, but now we need a vector of floating points instead
      • to do so, we arrange all elements of the matrix in a certain way into a single list

use a subset of \(N= 5000\) of the total 60,000 images available

  • hold out on the rest of the data for testing and 'novel' data
preprocess(img) = vec(Float64.(img))
xs = preprocess.(images[1:5000]);

creating labels with Flux.onehot

  • creates independent batches from arbitrary segments of the og dataset
function create_batch(r)
    xs = [preprocess(img) for img in images[r]]
    ys = [Flux.onehot(label, 0:9) for label in labels[r]]
    return (Flux.batch(xs), Flux.batch(ys))
end

### TRAIN MODEL ON FIRST 5000 IMAGES
trainbatch = create_batch(1:5000);

Setting up the neural network

  • since data is complicated, it is to be expected to need several layers

  • the network will take as inputs the vectors \(x^{(i)}\), so the input layer has \(n\) nodes

  • the output will be a one-hot vector encoding the digit from 1 to 9 or 0 that is desired

    • there are 10 possible categories
      • so output layer of size 10
  • the task designing neural networks is to insert layers between input and output layers

    • whose weights will be tuned during the learning process
model = Chain(Dense(n_inputs, n_outputs, identity), softmax)
L(x,y) = Flux.crossentropy(model(x),y)
opt = SGD(params(model))
# TRAINING!
Iterators.repeated(trainbatch, 100);
# current loss
L(trainbatch...)
# keep track of loss! insight of well the model is training

### Using callbacks
# provides the the ability to call a function
# at each step or every so often during training
callback() = @show(L(trainbatch...))

Flux.train!(L, Iterators.repeated(trainbatch, 3), opt; cb = callback)

### its expensive to calc the complete loss function
# flux provides the throttle function
# to call a given function at most once every certain # of seconds
Flux.train!(L, Iterators.repeated(trainbatch, 40), opt; cb = Flux.throttle(callback, 1))

# Our novel data to test on
testbatch = create_batch(5001:10000);

using Printf

train_loss =Float64[]
test_loss =Float64[]

function update_loss!() # stores loss values in above vectors
    push!(train_loss, L(trainbatch...).data)
    push!(test_loss, L(testbatchh...).data)
    @printf("train loss = %.2f, test loss = %.2f", train_loss, test_loss)
end

# train 1000 times with update_loss!
Flux.train!(L, Iterators.repeated(trainbatch, 1000), opt; cb = Flux.throttle(update_loss!, 1))
# loss values should decrease rapidly
# loss over testbatch is greater than loss over trainbatch
# model has less advantage testing the novel data!

Testing

  • the model is trained
    • we check how well the resulting trained network performs
      • when we test it with images that network has not yet seen!

        i = 5001
        display(images[i])
        labels[i], findmax(model(preprocess(images[i]))) .- (0, 1)
        
        model(preprocess(images[i]))
        

Evaluation

  • What percent of images are we correctly classifying if we take the highest element to be the chosen answer?

    prediciton(i) = findmax(model(preprocess(images[i])))[2]-1
    sum(prediction(i) == labels[i] for i in 1:5000)/5000
    sum(prediction(i) == labels[i] for i in 5001:10000)/5000
    

Improving the Prediction

  • try adding more layers!

  • try different activation funcs

  • or try different optimizers

    n_hidden = 20
    model = Chain(Dense(n_inputs, n_hidden, relu),
                  Dense(n_hidden, n_outputs, identity), softmax)
    L(x,y) = Flux.crossentropy(model(x), y)
    opt = ADAM(params(model))
    
    train_loss = Float64[]
    test_loss = Float64[]
    Flux.train!(L, Iterators.repeated(trainbatch, 1000), opt;
                cb= Flux.throttle(update_loss!, 1))