*Julia Intro Flux ML
Why Flux and What is Flux.jl?
-
a machine learning library written in 100% Julia code.
-
providing lightweight abstractions
- on top of native GPU
- and automatic differentiation support
- enables researchers, engineers, and developers to write performant and customizable machine learning code
Julia provides idiomatic and robust packages
Flux is used extensively in the SciML ecosystem for differential equations
Just like the core Julia lang
- Flux can be scaled up to run on anything from CPU to GPU cluster, to probably even larger…
-
providing lightweight abstractions
Building Deep Learning Models
-
Start by creating a dense layer
- w/ a single input and output
what is dense layer? also known as the fully-connected layer
- "neurons in a fully connected layer have full connections to all activations" (in the previous layer)
using Flux
model = Dense(1, 1)
# Flux automatically sets a random weight value for the layers created
model.weight, model.bias
-
in deep learning, weights and biases are ones of the many fine-tuning parameters
- that helps the model to learn something of value
- the power of deep learning is to stack layers on top of each other
- Flux provides a chain function, which allows us to connect multiple layers together
-
so they are called in-sequence.
l1 = Dense(1,1) l2 = Dense(1,1) model = Chain(l1,l2)-
Models in flux are like predictive functions
-
we take some input and make a prediction
- and provide an output
-
we take some input and make a prediction
-
Models in flux are like predictive functions
a dynamic state response given existing interconnected processes
-
Now lets create a convolutional layer, which will help our model do more complicated deep learning tasks
-
Using flat layers like a dense layers introduces challenges
-
the information around each pixel is lost in the learning process
- thats where the conovolutional layer comes in
-
the information around each pixel is lost in the learning process
-
Using flat layers like a dense layers introduces challenges
l2 = Conv((5,5), 3 => 7, relu)
# the first part is the filter, set to 5x5
# '5x5' refers to number of pixels
# the layer should be looking at as it filters
# through an image
###
# After the filter we set the input and output channel sizes
# the I/O channels depend on the data we are working with...
# Here we use 3 inputs channels and 7 output channels
###
# Last is the activation function
# we use the ReLU activation function
# Activation functions act as transformations
# that help tailor the model to our needs.
# FAKE CONVOLUTIONAL DATA
xs = rand(Float32, 100, 100, 3, 50);
# 50, 3-channel images w/ random float values associated w/ them
# we can pass these values into the convolutional layer
# and look at what the output is
l2(xs)
- Other layers in conjunction with convolutional layers allow us to emulate what some may consider to be a close representation of the function of the brain…
Intro to Flux.jl
-
to get started
sigmoid function (domain) (differentiability) (monotonicity) (asymptotes)
-
bounded, differentiable, real function
-
a function that maps any real-valued number into a value between 0 and 1 useful for converting outputs into probabilities
-
activation function in Machine Learning
- neural networks for modeling binary classification
- smoothing outputs
- introducing non-linearity into models
-
\(x\) is the input value
-
\(e\) is Euler's number (~2.718)
Euler's number
-
the base of the natural logarithm and exponential function
-
helps understand growth change, patterns in nature, how population expands, and how money grows with interest…
using Flux, Plots # built-in functions... σ # sigmoid function plot(σ, -5, 5, label="\\sigma", xlabel="x", ylabel="\\sigma\\(x\\)")
-
-
Flux allows us to automatically create neurons with `Dense` function
-
and all the supporting architecture to work with said neuron
model = Dense(2, 1, σ) Dense(2 => 1, σ) # 3 parameters model.weight 1×2 Matrix{Float32}: 0.992837 -0.799815 model.bias 1-element Vector{Float32}: 0.0 typeof(model.weight) Matrix{Float32} (alias for Array{Float32, 2}) x = rand(2) 2-element Vector{Float64}: 0.44811035782022945 0.6151330675136076 model(x) ┌ Warning: Layer with Float32 parameters got Float64 input. │ The input will be converted, but any earlier layers may be very slow. │ layer = Dense(2 => 1, σ) # 3 parameters │ summary(x) = "2-element Vector{Float64}" └ @ Flux C:\Users\jph6366\.julia\packages\Flux\uRn8o\src\layers\stateless.jl:60 1-element Vector{Float32}: 0.48822924 ## REASON THROUGH HOW THE ABOVE IS WORKING # UNLIKE PREVIOUSLY model.weight is no longer a `Vector` # and model.bias is no longer a number σ.(model.weight*x + model.bias) 1-element Vector{Float64}: 0.4882292155480665 # both are stored in `TrackedArray`s # and model.weight is treated as a matrix w/ a single row
-
-
the base of the natural logarithm and exponential function
-
-
activation function in Machine Learning
-
a function that maps any real-valued number into a value between 0 and 1 useful for converting outputs into probabilities
-
bounded, differentiable, real function
Other built-in function
-
the mean square error measures the amount of error in statistical models
-
assessing the averaged squared difference between the observed and predicted values
- when a model has no error the MSE equals zero
-
assessing the averaged squared difference between the observed and predicted values
-
also known as the the mean squared deviation
-
in regression, the mean squared error represents the average squared residual
-
as data points fall closer to the regression line, the model has less error
- less error = better predictions
-
as data points fall closer to the regression line, the model has less error
-
in regression, the mean squared error represents the average squared residual
\(MSE=\frac{\sum(y_{i}-\hat{y}_{i})^{2}}{n}\)
-
\(y_{i}\) is the i-th observed value
-
\(\hat{y}_{i}\) is the corresponding predicted value
- \(n\) = the number of observations
-
\(\hat{y}_{i}\) is the corresponding predicted value
In Flux is named Flux.mse
using CSV, DataFrames
apples = DataFrame('data', delim='\t', allowmissing=:none, normalizenames=true)
bananas = DataFrame('data', delim='\t', allowmissing=:none, normalizenames=true)
x_apples = [ [row.red, row.green] for row in eachrow(applses)]
x_bananas = [ [row.red, row.green] for row in eachrow(bananas)]
# we use this to train our model
xs = [x_apples, x_bananas]
ys = [fill(0, size(x_apples)); fill(1, size(x_bananas))];
model = Dense(2, 1, σ)
model(xs[end])
# examine the current loss value
loss = Flux.mse(model(xs[1]), ys[1])
# the model eval at first label as to compared actual label
# what is the gradient at this data point
# so we can improve the predictions...
Backpropagation
- figure out our gradients and derivatives
model.weight.data # is our data we work with
model.weight.grad # is the gradients
using Flux.Tracker
# back-bang
back!(loss)
# mutates all the things the loss computation touched also tracked
model.weight.grad
loss = Flux.mse(model(xs[1]), ys[1])
back!(loss) # back-propagation based on loss computation
model.weight.data .-= model.weight.grad * η # eta learning rate
model.bias.data .-= model.bias.grad * η # eta learning rate
model(xs[end])
# should compute an output a little bit closer to zero
###
# we want to loop over our entire datastep a bunch of times.
# pick a random data point to do a stochastic gradient descent
#
for step in 1:1000
i = rand(1:length(xs))
η = 0.01
loss = Flux.mse(model(xs[i]), ys[i])
back!(loss
model.weight.data .-= model.weight.grad * η
model.bias.data .-= model.bias.grad * η
end
# TRAINS OUR MODELS
# By implementing stochastic gradient descent to train
# flux makes it easier though
model(xs[1])
model(xs[end-1])
model = Dense(2, 1, σ)
L(x,y) = Flux.mse(model(x), y)
opt = SGD(params(model))
Flux.train!(L, zip(xs,ys), opt)
# ONE TRAINING STEP ON THE MODEL
# RUN IT MORE TO PERFORM BETTER FOR SUCCESSFUL TRAINING
for step in 1:100
Flux.train!(L, zip(xs,ys), opt)
end
Intro to Neural Networks
-
A neural network is made out of a network of neurons that are connected together
-
we can use vectors of 0 or 1 values to symbolize each output
the idea of using vectors is that different directions in space of outputs encode information about different types of inputs
-
Essential a linear system of equations…
linear part hidden in non-linearity
$σ(x;w1,b1):=
\frac{1}{1+exp(-w^{(1)}* x + b^{(1)})};$
$σ(x;w2,b2):=
\frac{1}{1+exp(-w^{(2)}* x + b^{(2)})};$
$σ(x;wi,bi):=
\frac{1}{1+exp(-w^{(i)}* x + b^{(i)})};$
linear algebra
-
represents biases as a vector
-
and the weights as a matrix
- weights are in rows respectively to mutliple outputs
-
and the weights as a matrix
every single neuron can be imagined as matrix multiplcation
- non-binary classifcation
Flux provides an efficient representation for one-hot vectors
-
instead of storing these vectors in memory
-
Flux records in which position the non-zero element is
using Flux: onehot onehot(1, 1:300)
-
- works just like julia arrays
all we need to do is train
- don't forget to do training loops!
model = Dense(2, 3, σ) # 2 inputs 3 outputs
L(x,y) = Flux.mse(model(x), y)
opt = SGD(params(model))
Flux.train!(L, zip(xs, ys), opt)
for _ in 1:100
Flux.train!(L, zip(xs, ys), opt)
end
Intro to Deep Learning
model = Chain(Dense(2, 4, σ), Dense(4, 3, σ))
L(x,y) = Flux.mse(model(x), y)
opt = SGD(params(model))
Flux.train!(L, zip(xs, ys), opt)
for _ in 1:1000:
Flux.train!(L, zip(xs, ys), opt)
end
Improve efficiency by batching!
Flux.batch(xs)
# changes the matrix vector multiplication
model(Flux.batch(xs)) # TO
# matrix matrix product
databatch = (Flux.batch(xs), Flux.batch(ys))
Flux.train!(L , iterators.repeated(databatch, 10000), opt)
# instead of for loop
Iterators.repeated(...)
# CHECK LOSS FOR IMPROVMENT
L(databatch[1], databatch[2])
Flux.crossentropy(...)
# designed for probability distribution
softmax([...])
# allows normalization across an entire domain
Use softmax as a final normalization and change the loss function to `crossentropy`
model = Chain(Dense(2, 4, σ), Dense(4, 3, identity), softmax)
# σ sigmoid is important non-linearity
L(x,y) = Flux.crossentropy(model(x), y)
# slowly and iteratively refine your model
opt = SGD(params(model))
Flux.train!(L, Iterators.repeated(databatch, 5000), opt)
Recognizing handwriting with a neural network
Data
using Flux, Flux.Data.MNIST, Images
labels = MNIST.labels() # groundtruth
images = MNIST.images() # images
length(images)
images[1:5] # look at first handful of images
labels[1:5] # transposed to match the above
images[1] # individual image
size(images[1]) # 784 pixels
typeof(images[1])# array of grayscale pixel
# Array{Gray{Normed{UInt8, 8}},2}
Float64.(images[1])# the numbers backing those the grayscale pixel
# 28x28 Array{Float64,2}
NN
-
previously we arranged vector of vectors, now we use a matrices
-
the column \(i\) of the matrix is a vector consisting of the \(i\)-th data point \(X^{(i)}\)
- the desired outputs are given as a matrix
- with the \(i\)-th column being the desired output \(y^{(i)}\)
-
the column \(i\) of the matrix is a vector consisting of the \(i\)-th data point \(X^{(i)}\)
n_inputs = unique(length.(images))[]
n_outputs = length(unique(labels))
-
create a vector of features each with the floating points values of ninputs pixels
-
an image is a matrix of colours, but now we need a vector of floating points instead
- to do so, we arrange all elements of the matrix in a certain way into a single list
-
an image is a matrix of colours, but now we need a vector of floating points instead
use a subset of \(N= 5000\) of the total 60,000 images available
- hold out on the rest of the data for testing and 'novel' data
preprocess(img) = vec(Float64.(img))
xs = preprocess.(images[1:5000]);
creating labels with Flux.onehot
- creates independent batches from arbitrary segments of the og dataset
function create_batch(r)
xs = [preprocess(img) for img in images[r]]
ys = [Flux.onehot(label, 0:9) for label in labels[r]]
return (Flux.batch(xs), Flux.batch(ys))
end
### TRAIN MODEL ON FIRST 5000 IMAGES
trainbatch = create_batch(1:5000);
Setting up the neural network
-
since data is complicated, it is to be expected to need several layers
-
the network will take as inputs the vectors \(x^{(i)}\), so the input layer has \(n\) nodes
-
the output will be a one-hot vector encoding the digit from 1 to 9 or 0 that is desired
-
there are 10 possible categories
- so output layer of size 10
-
there are 10 possible categories
-
the task designing neural networks is to insert layers between input and output layers
- whose weights will be tuned during the learning process
model = Chain(Dense(n_inputs, n_outputs, identity), softmax)
L(x,y) = Flux.crossentropy(model(x),y)
opt = SGD(params(model))
# TRAINING!
Iterators.repeated(trainbatch, 100);
# current loss
L(trainbatch...)
# keep track of loss! insight of well the model is training
### Using callbacks
# provides the the ability to call a function
# at each step or every so often during training
callback() = @show(L(trainbatch...))
Flux.train!(L, Iterators.repeated(trainbatch, 3), opt; cb = callback)
### its expensive to calc the complete loss function
# flux provides the throttle function
# to call a given function at most once every certain # of seconds
Flux.train!(L, Iterators.repeated(trainbatch, 40), opt; cb = Flux.throttle(callback, 1))
# Our novel data to test on
testbatch = create_batch(5001:10000);
using Printf
train_loss =Float64[]
test_loss =Float64[]
function update_loss!() # stores loss values in above vectors
push!(train_loss, L(trainbatch...).data)
push!(test_loss, L(testbatchh...).data)
@printf("train loss = %.2f, test loss = %.2f", train_loss, test_loss)
end
# train 1000 times with update_loss!
Flux.train!(L, Iterators.repeated(trainbatch, 1000), opt; cb = Flux.throttle(update_loss!, 1))
# loss values should decrease rapidly
# loss over testbatch is greater than loss over trainbatch
# model has less advantage testing the novel data!
Testing
-
the model is trained
-
we check how well the resulting trained network performs
-
when we test it with images that network has not yet seen!
i = 5001 display(images[i]) labels[i], findmax(model(preprocess(images[i]))) .- (0, 1) model(preprocess(images[i]))
-
-
we check how well the resulting trained network performs
Evaluation
-
What percent of images are we correctly classifying if we take the highest element to be the chosen answer?
prediciton(i) = findmax(model(preprocess(images[i])))[2]-1 sum(prediction(i) == labels[i] for i in 1:5000)/5000 sum(prediction(i) == labels[i] for i in 5001:10000)/5000
Improving the Prediction
-
try adding more layers!
-
try different activation funcs
-
or try different optimizers
n_hidden = 20 model = Chain(Dense(n_inputs, n_hidden, relu), Dense(n_hidden, n_outputs, identity), softmax) L(x,y) = Flux.crossentropy(model(x), y) opt = ADAM(params(model)) train_loss = Float64[] test_loss = Float64[] Flux.train!(L, Iterators.repeated(trainbatch, 1000), opt; cb= Flux.throttle(update_loss!, 1))