Lecture 6 | Training Neural Networks I

Articles, Blog

Lecture 6 | Training Neural Networks I

Lecture 6 | Training Neural Networks I


– Okay, let’s get started. Okay, so today we’re going to
get into some of the details about how we train neural networks. So, some administrative details first. Assignment 1 is due today, Thursday, so 11:59 p.m. tonight on Canvas. We’re also going to be
releasing Assignment 2 today, and then your project proposals are due Tuesday, April 25th. So you should be really starting to think about your projects now
if you haven’t already. How many people have decided what they want to do for
their project so far? Okay, so some, some people, so yeah, everyone else, you
can go to TA office hours if you want suggestions and bounce ideas off of TAs. We also have a list of projects that other people have proposed. Some people usually
affiliated with Stanford, so on Piazza, so you
can take a look at those for additional ideas. And we also have some notes on backprop for a linear layer and a
vector and tensor derivatives that Justin’s written up, so that should help with understanding how exactly backprop works and for vectors and matrices. So these are linked to
lecture four on the syllabus and you can go and take a look at those. Okay, so where we are now. We’ve talked about how
to express a function in terms of a computational graph, that we can represent any function in terms of a computational graph. And we’ve talked more explicitly
about neural networks, which is a type of graph where we have these linear layers that we stack on top of each other with nonlinearities in between. And we’ve also talked last lecture about convolutional neural networks, which are a particular type of network that uses convolutional layers to preserve the spatial structure throughout all the the
hierarchy of the network. And so we saw exactly how
a convolution layer looked, where each activation map in the convolutional layer output is produced by sliding a filter of weights over all of the spatial
locations in the input. And we also saw that usually we can have many filters per layer, each of which produces a
separate activation map. And so what we can get
is from an input right, with a certain depth, we’ll
get an activation map output, which has some spatial
dimension that’s preserved, as well as the depth is
the total number of filters that we have in that layer. And so what we want to do is we want to learn the values of all of these weights or parameters, and we saw that we can learn our network parameters
through optimization, which we talked about little bit earlier in the course, right? And so we want to get to a
point in the loss landscape that produces a low loss, and we can do this by taking steps in the direction of the negative gradient. And so the whole process we actually call a Mini-batch Stochastic Gradient Descent where the steps are that we continuously, we sample a batch of data. We forward prop it through our computational graph
or our neural network. We get the loss at the end. We backprop through our network to calculate the gradients. And then we update the parameters or the weights in our
network using this gradient. Okay, so now for the
next couple of lectures we’re going to talk
about some of the details involved in training neural networks. And so this involves things like how do we set up our neural
network at the beginning, which activation functions that we choose, how do we preprocess the data, weight initialization,
regularization, gradient checking. We’ll also talk about training dynamics. So, how do we babysit
the learning process? How do we choose how we
do parameter updates, specific perimeter update rules, and how do we do
hyperparameter optimization to choose the best hyperparameters? And then we’ll also talk about evaluation and model ensembles. So today in the first part, I will talk about activation
functions, data preprocessing, weight initialization,
batch normalization, babysitting the learning process, and hyperparameter optimization. Okay, so first activation functions. So, we saw earlier how out
of any particular layer, we have the data coming in. We multiply by our weight in you know, fully connected or a convolutional layer. And then we’ll pass this through an activation function or nonlinearity. And we saw some examples of this. We used sigmoid previously
in some of our examples. We also saw the ReLU nonlinearity. And so today we’ll talk
more about different choices for these different nonlinearities and trade-offs between them. So first, the sigmoid,
which we’ve seen before, and probably the one we’re
most comfortable with, right? So the sigmoid function
is as we have up here, one over one plus e to the negative x. And what this does is it takes each number that’s input into the sigmoid
nonlinearity, so each element, and the elementwise squashes these into this range [0,1] right,
using this function here. And so, if you get very
high values as input, then output is going to
be something near one. If you get very low values, or, I’m sorry, very negative values, it’s going to be near zero. And then we have this regime near zero that it’s in a linear regime. It looks a bit like a linear function. And so this is been historically popular, because sigmoids, in a sense, you can interpret them as a kind of a saturating firing
rate of a neuron, right? So if it’s something between zero and one, you could think of it as a firing rate. And we’ll talk later about
other nonlinearities, like ReLUs that, in practice,
actually turned out to be more biologically plausible, but this does have a
kind of interpretation that you could make. So if we look at this
nonlinearity more carefully, there’s several problems that
there actually are with this. So the first is that saturated neurons can kill off the gradient. And so what exactly does this mean? So if we look at a sigmoid gate right, a node in our computational graph, and we have our data X as input into it, and then we have the output of the sigmoid gate coming out of it, what does the gradient flow look like as we’re coming back? We have dL over d sigma right? The upstream gradient coming down, and then we’re going to
multiply this by dSigma over dX. This will be the gradient
of a local sigmoid function. And we’re going to chain these together for our downstream
gradient that we pass back. So who can tell me what happens when X is equal to -10? It’s very negative. What does is gradient look like? Zero, yeah, so that’s right. So the gradient become zero and that’s because in this negative, very negative region of the sigmoid, it’s essentially flat,
so the gradient is zero, and we chain any upstream
gradient coming down. We multiply by basically
something near zero, and we’re going to get
a very small gradient that’s flowing back downwards, right? So, in a sense, after the chain rule, this kills the gradient flow
and you’re going to have a zero gradient passed
down to downstream nodes. And so what happens
when X is equal to zero? So there it’s, yeah,
it’s fine in this regime. So, in this regime near zero, you’re going to get a
reasonable gradient here, and then it’ll be fine for backprop. And then what about X equals 10? Zero, right. So again, so when X is
equal to a very negative or X is equal to large positive numbers, then these are all regions where the sigmoid function is flat, and it’s going to kill off the gradient and you’re not going to get
a gradient flow coming back. Okay, so a second problem is that the sigmoid outputs are not zero centered. And so let’s take a look
at why this is a problem. So, consider what happens when the input to a neuron is always positive. So in this case, all of our Xs
we’re going to say is positive. It’s going to be multiplied
by some weight, W, and then we’re going to run it through our activation function. So what can we say about
the gradients on W? So think about what the local
gradient is going to be, right, for this linear layer. We have DL over whatever
the activation function, the loss coming down, and then we have our local gradient, which is going to be basically X, right? And so what does this mean,
if all of X is positive? Okay, so I heard it’s
always going to be positive. So that’s almost right. It’s always going to be either positive, or all positive or all negative, right? So, our upstream gradient coming down is DL over our loss. L is going to be DL over DF. and this is going to be
either positive or negative. It’s some arbitrary gradient coming down. And then our local gradient
that we multiply this by is, if we’re going to find the gradients on W, is going to be DF over DW,
which is going to be X. And if X is always positive
then the gradients on W, which is multiplying these two together, are going to always be the sign of the upstream
gradient coming down. And so what this means is
that all the gradients of W, since they’re always either
positive or negative, they’re always going to
move in the same direction. You’re either going to
increase all of the, when you do a parameter update, you’re going to either
increase all of the values of W by a positive amount, or
differing positive amounts, or you will decrease them all. And so the problem with this is that, this gives very inefficient
gradient updates. So, if you look at on the right here, we have an example of a case where, let’s say W is two-dimensional, so we have our two axes for W, and if we say that we can only have all positive or all negative updates, then we have these two quadrants, and, are the two places where the axis are either all positive or negative, and these are the only directions in which we’re allowed to make a gradient update. And so in the case where, let’s say our hypothetical optimal W is actually this blue vector here, right, and we’re starting off
at you know some point, or at the top of the the the
beginning of the red arrows, we can’t just directly take a gradient update in this direction, because this is not in one of those two allowed gradient directions. And so what we’re going to have to do, is we’ll have to take a
sequence of gradient updates. For example, in these red arrow directions that are each in allowed directions, in order to finally get to this optimal W. And so this is why also, in general, we want a zero mean data. So, we want our input X to be zero meaned, so that we actually have
positive and negative values and we don’t get into this problem of the gradient updates. They’ll be all moving
in the same direction. So is this clear? Any questions on this point? Okay. Okay, so we’ve talked about these two main problems of the sigmoid. The saturated neurons
can kill the gradients if we’re too positive or
too negative of an input. They’re also not zero-centered and so we get these, this inefficient kind of gradient update. And then a third problem, we have an exponential function in here, so this is a little bit
computationally expensive. In the grand scheme of your network, this is usually not the main problem, because we have all these convolutions and dot products that
are a lot more expensive, but this is just a minor
point also to observe. So now we can look at a second activation function here at tanh. And so this looks very
similar to the sigmoid, but the difference is that now it’s squashing to the range [-1, 1]. So here, the main difference is that it’s now zero-centered, so we’ve gotten rid of the
second problem that we had. It still kills the gradients,
however, when it’s saturated. So, you still have these regimes where the gradient is essentially flat and you’re going to
kill the gradient flow. So this is a bit better than the sigmoid, but it still has some problems. Okay, so now let’s look at
the ReLU activation function. And this is one that we saw
in our examples last lecture when we were talking about the convolutional neural network. And we saw that we interspersed
ReLU nonlinearities between many of the convolutional layers. And so, this function is f of
x equals max of zero and x. So it takes an elementwise
operation on your input and basically if your input is negative, it’s going to put it to zero. And then if it’s positive, it’s going to be just passed through. It’s the identity. And so this is one that’s
pretty commonly used, and if we look at this one and look at and think about the problems that we saw earlier with
the sigmoid and the tanh, we can see that it doesn’t saturate in the positive region. So there’s whole half of our input space where it’s not going to saturate, so this is a big advantage. So this is also
computationally very efficient. We saw earlier that the sigmoid has this E exponential in it. And so the ReLU is just this simple max and there’s, it’s extremely fast. And in practice, using this ReLU, it converges much faster than
the sigmoid and the tanh, so about six times faster. And it’s also turned out to be more biologically plausible than the sigmoid. So if you look at a neuron and you look at what the inputs look like, and you look at what
the outputs look like, and you try to measure this
in neuroscience experiments, you’ll see that this one is actually a closer approximation to what’s happening than sigmoids. And so ReLUs were starting to be used a lot around 2012 when we had AlexNet, the first major
convolutional neural network that was able to do well on ImageNet and large-scale data. They used the ReLU in their experiments. So a problem however, with the ReLU, is that it’s still, it’s not
not zero-centered anymore. So we saw that the sigmoid
was not zero-centered. Tanh fixed this and now
ReLU has this problem again. And so that’s one of
the issues of the ReLU. And then we also have
this further annoyance of, again we saw that in the
positive half of the inputs, we don’t have saturation, but this is not the case
of the negative half. Right, so just thinking about this a little bit more precisely. So what’s happening here
when X equals negative 10? So zero gradient, that’s right. What happens when X is
equal to positive 10? It’s good, right. So, we’re in the linear regime. And then what happens
when X is equal to zero? Yes, it undefined here, but in practice, we’ll
say, you know, zero, right. And so basically, it’s
killing the gradient in half of the regime. And so we can get this phenomenon of basically dead ReLUs, when we’re in this bad part of the regime. And so there’s, you can look at this in, as coming from several potential reasons. And so if we look at our data cloud here, this is all of our training data, then if we look at where
the ReLUs can fall, so the ReLUs can be,
each of these is basically the half of the plane where
it’s going to activate. And so each of these is the plane that defines each of these ReLUs, and we can see that you
can have these dead ReLUs that are basically off of the data cloud. And in this case, it will never
activate and never update, as compared to an active ReLU where some of the data is going to be positive and passed through and some won’t be. And so there’s several reasons for this. The first is that it can happen when you have bad initialization. So if you have weights
that happen to be unlucky and they happen to be off the data cloud, so they happen to specify
this bad ReLU over here. Then they’re never going to get a data input that causes it to activate, and so they’re never going to get good gradient flow coming back. And so it’ll just never
update and never activate. What’s the more common case is when your learning rate is too high. And so this case you started
off with an okay ReLU, but because you’re making
these huge updates, the weights jump around and then your ReLU unit in a sense, gets knocked off of the data manifold. And so this happens through training. So it was fine at the beginning and then at some point,
it became bad and it died. And so if in practice, if you freeze a network
that you’ve trained and you pass the data through, you can see it actually
is much as 10 to 20% of the network is these dead ReLUs. And so you know that’s a problem, but also most networks do have this type of problem when you use ReLUs. Some of them will be dead, and in practice, people look into this, and it’s a research problem, but it’s still doing okay
for training networks. Yeah, is there a question? [student speaking off mic] Right. So the question is, yeah, so the data cloud is
just your training data. [student speaking off mic] Okay, so the question is when, how do you tell when the ReLU
is going to be dead or not, with respect to the data cloud? And so if you look at, this is an example of like a
simple two-dimensional case. And so our ReLU, we’re going
to get our input to the ReLU, which is going to be a basically you know, W1 X1 plus W2 X2, and it we apply this, so that that defines this this
separating hyperplane here, and then we’re going to take half of it that’s going to be positive, and half of it’s going to be killed off, and so yes, so you, you know you just, it’s whatever the weights happened to be, and where the data happens
to be is where these, where these hyperplanes fall, and so, so yeah so just throughout
the course of training, some of your ReLUs will
be in different places, with respect to the data cloud. Oh, question. [student speaking off mic] Yeah. So okay, so the question is for the sigmoid we talked
about two drawbacks, and one of them was that the
neurons can get saturated, so let’s go back to the sigmoid here, and the question was this is not the case, when all of your inputs are positive. So when all of your inputs are positive, they’re all going to be coming in in this zero plus region here, and so you can still
get a saturating neuron, because you see up in
this positive region, it also plateaus at one, and so when it’s when you
have large positive values as input you’re also going
to get the zero gradient, because you have you
have a flat slope here. Okay. Okay, so in practice people
also like to initialize ReLUs with slightly positive biases, in order to increase the
likelihood of it being active at initialization
and to get some updates. Right and so this basically
just biases towards more ReLUs firing at the beginning, and in practice some say that it helps. Some say that it doesn’t. Generally people don’t always use this. It’s yeah, a lot of times
people just initialize it with zero biases still. Okay, so now we can look
at some modifications on the ReLU that have come out since then, and so one example is this leaky ReLU. And so this looks very
similar to the original ReLU, and the only difference is
that now instead of being flat in the negative regime, we’re going to give a
slight negative slope here And so this solves a lot of the problems that we mentioned earlier. Right here we don’t have
any saturating regime, even in the negative space. It’s still very computationally efficient. It still converges faster
than sigmoid and tanh, very similar to a ReLU. And it doesn’t have this dying problem. And there’s also another example is the parametric rectifier, so PReLU. And so in this case it’s
just like a leaky ReLU where we again have this sloped region in the negative space, but now this slope in the negative regime is determined through
this alpha parameter, so we don’t specify,
we don’t hard-code it. but we treat it as now a parameter that we can backprop into and learn. And so this gives it a
little bit more flexibility. And we also have something called an Exponential Linear Unit, an ELU, so we have all these
different LUs, basically. and this one again, you know, it has all the benefits of the ReLu, but now you’re, it is also
closer to zero mean outputs. So, that’s actually an
advantage that the leaky ReLU, parametric ReLU, a lot
of these they allow you to have your mean closer to zero, but compared with the leaky ReLU, instead of it being sloped
in the negative regime, here you actually are building back in a negative saturation regime, and there’s arguments that
basically this allows you to have some more robustness to noise, and you basically get
these deactivation states that can be more robust. And you can look at this paper for, there’s a lot of kind
of more justification for why this is the case. And in a sense this is kind of something in between the ReLUs and the leaky ReLUs, where has some of this shape, which the Leaky ReLU does, which gives it closer to zero mean output, but then it also still has some of this more saturating behavior that ReLUs have. A question? [student speaking off mic] So, whether this parameter alpha is going to be specific for each neuron. So, I believe it is often specified, but I actually can’t remember exactly, so you can look in the paper for exactly, yeah, how this is defined, but yeah, so I believe
this function is basically very carefully designed in order to have nice desirable properties. Okay, so there’s basically all of these kinds of variants on the ReLU. And so you can see that,
all of these it’s kind of, you can argue that each one
may have certain benefits, certain drawbacks in practice. People just want to run
experiments all of them, and see empirically what works better, try and justify it, and
come up with new ones, but they’re all different things that are being experimented with. And so let’s just mention one more. This is Maxout Neuron. So, this one looks a little bit different in that it doesn’t have the
same form as the others did of taking your basic dot product, and then putting this element-wise nonlinearity in front of it. Instead, it looks like this, this max of W dot product of X plus B, and a second set of weights, W2 dot product with X plus B2. And so what does this,
is this is taking the max of these two functions in a sense. And so what it does is
it generalizes the ReLU and the leaky ReLu, because you’re just you’re
taking the max over these two, two linear functions. And so what this give us, it’s again you’re operating
in a linear regime. It doesn’t saturate and it doesn’t die. The problem is that here, you are doubling the number
of parameters per neuron. So, each neuron now has this
original set of weights, W, but it now has W1 and W2,
so you have twice these. So in practice, when we look at all of
these activation functions, kind of a good general
rule of thumb is use ReLU. This is the most standard one that generally just works well. And you know you do want
to be careful in general with your learning rates
to adjust them based, see how things do. We’ll talk more about
adjusting learning rates later in this lecture, but you can also try out some of these fancier activation functions, the leaky ReLU, Maxout, ELU, but these are generally, they’re still kind of more experimental. So, you can see how they
work for your problem. You can also try out tanh, but probably some of these ReLU and ReLU variants are going to be better. And in general don’t use sigmoid. This is one of the earliest
original activation functions, and ReLU and these other variants have generally worked better since then. Okay, so now let’s talk a little bit about data preprocessing. Right, so the activation function, we design this is part of our network. Now we want to train the network, and we have our input data that we want to start training from. So, generally we want to
always preprocess the data, and this is something that
you’ve probably seen before in machine learning
classes if you taken those. And some standard types
of preprocessing are, you take your original data and you want to zero mean them, and then you probably want
to also normalize that, so normalized by the standard deviation, And so why do we want to do this? For zero centering, you
can remember earlier that we talked about when all the inputs are positive, for example, then we get all of our gradients on the weights to be positive, and we get this basically
suboptimal optimization. And in general even if it’s
not all zero or all negative, any sort of bias will still
cause this type of problem. And so then in terms of
normalizing the data, this is basically you
want to normalize data typically in the machine
learning problems, so that all features
are in the same range, and so that they contribute equally. In practice, since for
images, which is what we’re dealing with in this
course here for the most part, we do do the zero centering, but in practice we
don’t actually normalize the pixel value so much,
because generally for images right at each location you already have relatively comparable
scale and distribution, and so we don’t really
need to normalize so much, compared to more general
machine learning problems, where you might have different features that are very different and
of very different scales. And in machine learning, you might also see a
more complicated things, like PCA or whitening,
but again with images, we typically just stick
with the zero mean, and we don’t do the normalization, and we also don’t do some of these more complicated pre-processing. And one reason for this
is generally with images we don’t really want to
take all of our input, let’s say pixel values and project this onto a lower dimensional space of new kinds of features
that we’re dealing with. We typically just want to apply convolutional networks spatially and have our spatial structure
over the original image. Yeah, question. [student speaking off mic] So the question is we
do this pre-processing in a training phase, do we also do the same kind
of thing in the test phase, and the answer is yes. So, let me just move
to the next slide here. So, in general on the training phase is where we determine our let’s say, mean, and then we apply this exact
same mean to the test data. So, we’ll normalize by the same empirical mean from the training data. Okay, so to summarize
basically for images, we typically just do the
zero mean pre-processing and we can subtract either
the entire mean image. So, from the training data, you compute the mean image, which will be the same size
as your, as each image. So, for example 32 by 32 by three, you’ll get this array of numbers, and then you subtract that from each image that you’re about to
pass through the network, and you’ll do the same thing at test time for this array that you
determined at training time. In practice, we can
also for some networks, we also do this by just of subtracting a per-channel mean, and so instead of having
an entire mean image that were going to zero-center by, we just take the mean by channel, and this is just because it turns out that it was similar enough
across the whole image, it didn’t make such a big difference to subtract the mean image versus just a per-channel value. And this is easier to just
pass around and deal with. So, you’ll see this as well
for example, in a VGG Network, which is a network that
came after AlexNet, and we’ll talk about that later. Question. [student speaking off mic] Okay, so there are two questions. The first is what’s a
channel, in this case, when we are subtracting
a per-channel mean? And this is RGB, so our array, our images are typically for
example, 32 by 32 by three. So, width, height, each are 32, and our depth, we have three channels RGB, and so we’ll have one
mean for the red channel, one mean for a green, one for blue. And then the second, what
was your second question? [student speaking off mic] Oh. Okay, so the question is when we’re subtracting the mean image, what is the mean taken over? And the mean is taking over
all of your training images. So, you’ll take all of
your training images and just compute the mean of all of those. Does that make sense? [student speaking off mic] Yeah the question is, we do this for the entire training set, once before we start training. We don’t do this per batch, and yeah, that’s exactly correct. So we just want to have a good sample, an empirical mean that we have. And so if you take it per batch, if you’re sampling reasonable batches, it should be basically, you should be getting the same values anyways for the mean, and so it’s more efficient and easier just do this once at the beginning. You might not even have to really take it over the entire training data. You could also just sample
enough training images to get a good estimate of your mean. Okay, so any other questions
about data preprocessing? Yes. [student speaking off mic] So, the question is does
the data preprocessing solve the sigmoid problem? So the data preprocessing
is doing zero mean right? And we talked about how sigmoid, we want to have zero mean. And so it does solve
this for the first layer that we pass it through. So, now our inputs to the first layer of our network is going to be zero mean, but we’ll see later on that
we’re actually going to have this problem come up in
much worse and greater form, as we have deep networks. You’re going to get a lot of nonzero mean problems later on. And so in this case, this is
not going to be sufficient. So this only helps at the
first layer of your network. Okay, so now let’s talk
about how do we want to initialize the weights of our network? So, we have let’s say our standard two layer neural network and we have all of these
weights that we want to learn, but we have to start them
with some value, right? And then we’re going to update them using our gradient updates from there. So first question. What happens when we use an initialization of W equals zero? We just set all of the
parameters to be zero. What’s the problem with this? [student speaking off mic] So sorry, say that again. So I heard all the neurons
are going to be dead. No updates ever. So not exactly. So, part of that is correct in that all the neurons will do the same thing. So, they might not all be dead. Depending on your input value, I mean, you could be in any
regime of your neurons, so they might not be dead, but the key thing is that they
will all do the same thing. So, since your weights are zero, given an input, every
neuron is going to be, have the same operation
basically on top of your inputs. And so, since they’re all
going to output the same thing, they’re also all going
to get the same gradient. And so, because of that, they’re all going to
update in the same way. And now you’re just going to get all neurons that are exactly the same, which is not what you want. You want the neurons to
learn different things. And so, that’s the problem when you initialize everything equally and there’s basically no
symmetry breaking here. So, what’s the first, yeah question? [student speaking off mic] So the question is, because that, because the gradient
also depends on our loss, won’t one backprop differently
compared to the other? So in the last layer, like yes, you do have basically some of this, the gradients will get the same, sorry, will get different
loss for each specific neuron based on which class it was connected to, but if you look at all the neurons generally throughout your
network, like you’re going to, you basically have a lot of these neurons that are connected in
exactly the same way. They had the same updates and it’s basically
going to be the problem. Okay, so the first idea that we can have to try and improve upon this is to set all of the weights
to be small random numbers that we can sample from a distribution. So, in this case, we’re
going to sample from basically a standard gaussian, but we’re going to scale it so that the standard deviation is actually one E negative two, 0.01. And so, just give this
many small random weights. And so, this does work
okay for small networks, now we’ve broken the symmetry, but there’s going to be
problems with deeper networks. And so, let’s take a look
at why this is the case. So, here this is basically
an experiment that we can do where let’s take a deeper network. So in this case, let’s initialize
a 10 layer neural network to have 500 neurons in
each of these 10 layers. Okay, we’ll use tanh
nonlinearities in this case and we’ll initialize it
with small random numbers as we described in the last slide. So here, we’re going to basically just initialize this network. We have random data
that we’re going to take, and now let’s just pass it
through the entire network, and at each layer, look at the statistics of the activations that
come out of that layer. And so, what we’ll see this is probably a little bit hard to read up top, but if we compute the mean and the standard deviations at each layer, well see that at the first layer this is, the means are always around zero. There’s a funny sound in here. Interesting, okay well that was fixed. So, if we look at, if we look
at the outputs from here, the mean is always
going to be around zero, which makes sense. So, if we look here, let’s see, if we take this, we looked at
the dot product of X with W, and then we took the tanh on linearity, and then we store these values and so, because it tanh is centered around zero, this will make sense, and then the standard
deviation however shrinks, and it quickly collapses to zero. So, if we’re plotting this, here this second row of plots here is showing the mean
and standard deviations over time per layer
and then in the bottom, the sequence of plots is
showing for each of our layers. What’s the distribution of
the activations that we have? And so, we can see that
at the first layer, we still have a reasonable
gaussian looking thing. It’s a nice distribution. But the problem is that as we multiply by this W, these small numbers at each layer, this quickly shrinks and
collapses all of these values, as we multiply this over and over again. And so, by the end, we
get all of these zeros, which is not what we want. So we get all the activations become zero. And so now let’s think
about the backwards pass. So, if we do a backward pass, now assuming this was our forward pass and now we want to compute our gradients. So first, what does the gradients look like on the weights? Does anyone have a guess? So, if we think about this, we have our input values are very small at each layer right, because they’ve all
collapsed at this near zero, and then now each layer, we have our upstream
gradient flowing down, and then in order to get
the gradient on the weights remember it’s our upstream gradient
times our local gradient, which for this this dot
product were doing W times X. It’s just basically going to
be X, which is our inputs. So, it’s again a similar kind of problem that we saw earlier, where now since, so
here because X is small, our weights are getting
a very small gradient, and they’re basically not updating. So, this is a way that you can
basically try and think about the effect of gradient
flows through your networks. You can always think about
what the forward pass is doing, and then think about what’s happening as you have gradient flows coming down, and different types of inputs, what the effect of this
actually is on our weights and the gradients on them. And so also, if now if we think about what’s the gradient that’s
going to be flowing back from each layer as we’re
chaining all these gradients. Alright, so this is going to be the flip thing where we have now the gradient flowing back
is our upstream gradient times in this case the local
gradient is W on our input X. And so again, because
this is the dot product, and so now, actually going
backwards at each layer, we’re basically doing a multiplication of the upstream gradient by our weights in order to get the next
gradient flowing downwards. And so because here, we’re multiplying by
W over and over again. You’re getting basically
the same phenomenon as we had in the forward pass where everything is getting
smaller and smaller. And now the gradient, upstream gradients are collapsing to zero as well. Question? [student speaking off mic] Yes, I guess upstream and downstream is, can be interpreted differently, depending on if you’re
going forward and backward, but in this case we’re going, we’re doing, we’re going backwards, right? We’re doing back propagation. And so upstream is the gradient flowing, you can think of a flow from your loss, all the way back to your input. And so upstream is what came from what you’ve already done, flowing you know, down
into your current node. Right, so we’re for flowing downwards, and what we get coming into
the node through backprop is coming from upstream. Okay, so now let’s think
about what happens when, you know we saw that this was a problem when our weights were pretty small, right? So, we can think about well, what if we just try and solve this by making our weights big? So, let’s sample from
this standard gaussian, now with standard deviation one instead of 0.01. So what’s the problem here? Does anyone have a guess? If our weights are now all
big, and we’re passing them, and we’re taking these
outputs of W times X, and passing them through
tanh nonlinearities, remember we were talking
about what happens at different values of inputs to tanh, so what’s the problem? Okay, so yeah I heard that
it’s going to be saturated, so that’s right. Basically now, because our
weights are going to be big, we’re going to always
be at saturated regimes of either very negative or
very positive of the tanh. And so in practice, what
you’re going to get here is now if we look at the
distribution of the activations at each of the layers here on the bottom, they’re going to be all basically
negative one or plus one. Right, and so this will have the problem that we talked about with the tanh earlier,
when they’re saturated, that all the gradients will be zero, and our weights are not updating. So basically, it’s really hard to get your weight initialization right. When it’s too small they all collapse. When it’s too large they saturate. So, there’s been some work
in trying to figure out well, what’s the proper way
to initialize these weights. And so, one kind of good rule
of thumb that you can use is the Xavier initialization. And so this is from this
paper by Glorot in 2010. And so what this formula is, is if we look at W up here, we can see that we want to
initialize them to these, we sample from our standard gaussian, and then we’re going to scale by the number of inputs that we have. And you can go through the math, and you can see in the lecture notes as well as in this paper of exactly how this works out, but basically the way
we do it is we specify that we want the variance of the input to be the same as a
variance of the output, and then if you derive
what the weight should be you’ll get this formula, and intuitively with this
kind of means is that if you have a small
number of inputs right, then we’re going to divide
by the smaller number and get larger weights,
and we need larger weights, because with small inputs, and you’re multiplying
each of these by weight, you need a larger weights to get the same larger variance at output, and kind of vice versa for
if we have many inputs, then we want smaller
weights in order to get the same spread at the output. So, you can look at the notes
for more details about this. And so basically now, if we
want to have a unit gaussian, right as input to each layer, we can use this kind of initialization to at training time, to be
able to initialize this, so that there is approximately a unit gaussian at each layer. Okay, and so one thing
is does assume though is that it is assumed that
there’s linear activations. and so it assumes that
we are in the activation, in the active region of
the tanh, for example. And so again, you can look at the notes to really try and
understand its derivation, but the problem is that this breaks when now you use something like a ReLU. Right, and so with the
ReLU what happens is that, because it’s killing half of your units, it’s setting approximately half of them to zero at each time, it’s actually halving the
variance that you get out of this. And so now, if you just
make the same assumptions as your derivation
earlier you won’t actually get the right variance coming out, it’s going to be too small. And so what you see is again
this kind of phenomenon, as the distributions starts collapsing. In this case you get more
and more peaked toward zero, and more units deactivated. And the way to address this with something that has been
pointed out in some papers, which is that you can you
can try to account for this with an extra, divided by two. So, now you’re basically
adjusting for the fact that half the neurons get killed. And so you’re kind of equivalent input has actually half this number of input, and so you just add this
divided by two factor in, this works much better, and you can see that the
distributions are pretty good throughout all layers of the network. And so in practice this is
been really important actually, for training these types of little things, to a really pay attention
to how your weights are, make a big difference. And so for example,
you’ll see in some papers that this actually is the
difference between the network even training at all and performing well versus nothing happening. So, proper initialization is still an active area of research. And so if you’re interested in this, you can look at a lot of
these papers and resources. A good general rule of thumb is basically use the Xavier
Initialization to start with, and then you can also think about some of these other kinds of methods. And so now we’re going to talk
about a related idea to this, so this idea of wanting
to keep activations in a gaussian range that we want. Right, and so this idea behind what we’re going to call
batch normalization is, okay we want unit gaussian activations. Let’s just make them that way. Let’s just force them to be that way. And so how does this work? So, let’s consider a batch
of activations at some layer. And so now we have all of
our activations coming out. If we want to make this unit gaussian, we actually can just do
this empirically, right. We can take the mean of the
batch that we have so far of the current batch, and we
can just and the variance, and we can just normalize by this. Right, and so basically, instead of with weight initialization, we’re setting this at
the start of training so that we try and get it into a good spot that we can have unit
gaussians at every layer, and hopefully during training
this will preserve this. Now we’re going to
explicitly make that happen on every forward pass through the network. We’re going to make this
happen functionally, and basically by normalizing by the mean and the variance of each neuron, we look at all of the
inputs coming into it and calculate the mean and
variance for that batch and normalize it by it. And the thing is that this is a, this is just a differentiable
function right? If we have our mean and
our variance as constants, this is just a sequence of
computational operations that we can differentiate and
do back prop through this. Okay, so just as I was
saying earlier right, if we look at our input
data, and we think of this as we have N training examples
in our current batch, and then each batch has dimension D, we’re going to the
compute the empirical mean and variance independently
for each dimension, so each basically feature element, and we compute this across our batch, our current mini-batch that we have and we normalize by this. And so this is usually
inserted after fully connected or convolutional layers. We saw that would we were multiplying by W in these layers, which
we do over and over again, then we can have this bad
scaling effect with each one. And so this basically is
able to undo this effect. Right, and since we’re basically
just scaling by the inputs connected to each neuron, each activation, we can apply this the same
way to fully connected convolutional layers, and
the only difference is that, with convolutional layers,
we want to normalize not just across all the training examples, and independently for each
each feature dimension, but we actually want to normalize jointly across both all the feature dimensions, all the spatial locations that we have in our activation map, as well as all of the training examples. And we do this, because we want to obey
the convolutional property, and we want nearby locations to be normalized the same way, right? And so with a convolutional layer, we’re basically going to have a one mean and one standard deviation, per activation map that that we have, and we’re going to normalize by this across all of the examples in the batch. And so this is something that you guys are going to implement
in your next homework. And so, all of these details
are explained very clearly in this paper from 2015. And so on this is a very useful, useful technique that you
want to use a lot in practice. You want to have these
batch normalization layers. And so you should read this paper. Go through all of the derivations, and then also go through the derivations of how to compute the
gradients with given these, this normalization operation. Okay, so one thing that I just
want to point out is that, it’s not clear that, you know, we’re doing this batch normalization after every fully connected layer, and it’s not clear that
we necessarily want a unit gaussian input to
these tanh nonlinearities, because what this is doing
is this is constraining you to the linear regime of this nonlinearity, and we’re not actually, you’re
trying to basically say, let’s not have any of this saturation, but maybe a little bit
of this is good, right? You you want to be able to control what’s, how much saturation that you want to have. And so what, the way that we address this when we’re doing batch normalization is that we have our normalization operation, but then after that we
have this additional squashing and scaling operation. So, we do our normalization. Then we’re going to scale
by some constant gamma, and then shift by another factor of beta. Right, and so what this
actually does is that this allows you to be able to
recover the identity function if you wanted to. So, if the network wanted to, it could learn your scaling factor gamma to be just your variance. It could learn your beta to be your mean, and in this case you can
recover the identity mapping, as if you didn’t have batch normalization. And so now you have the flexibility of doing kind of everything in between and making your the network learning how to make your tanh
more or less saturated, and how much to do so in order to have, to have good training. Okay, so just to sort of summarize the batch normalization idea. Right, so given our inputs, we’re going to compute
our mini-batch mean. So, we do this for every
mini-batch that’s coming in. We compute our variance. We normalize by the mean and variance, and we have this additional
scaling and shifting factor. And so this improves gradient
flow through the network. it’s also more robust as a result. It works for more range of learning rates, and different kinds of initialization, so people have seen that once you put batch normalization in, and it’s just easier to train, and so that’s why you should do this. And then also when one thing
that I just want to point out is that you can also
think of this as in a way also doing some regularization. Right and so, because now
at the output of each layer, each of these activations,
each of these outputs, is an output of both your input X, as well as the other examples in the batch that it happens to be sampled with, right, because you’re going to
normalize each input data by the empirical mean over that batch. So because of that,
it’s no longer producing deterministic values for
a given training example, and it’s tying all of these
inputs in a batch together. And so this basically, because
it’s no longer deterministic, kind of jitters your
representation of X a little bit, and in a sense, gives some
sort of regularization effect. Yeah, question? [student speaking off camera] The question is gamma and
beta are learned parameters, and yes that’s the case. [student speaking off mic] Yeah, so the question is
why do we want to learn this gamma and beta to be able to learn the identity function back, and the reason is because you want to give it the flexibility. Right, so what batch
normalization is doing, is it’s forcing our data to
become this unit gaussian, our inputs to be unit gaussian, but even though in general
this is a good idea, it’s not always that this is
exactly the best thing to do. And we saw in particular
for something like a tanh, you might want to control some degree of saturation that you have. And so what this does is it
gives you the flexibility of doing this exact like
unit gaussian normalization, if it wants to, but
also learning that maybe in this particular part of the network, maybe that’s not the best thing to do. Maybe we want something
still in this general idea, but slightly different right,
slightly scaled or shifted. And so these parameters just
give it that extra flexibility to learn that if it wants to. And then yeah, if the the best thing to do is just batch normalization
then it’ll learn the right parameters for that. Yeah? [student speaking off mic] Yeah, so basically each neuron output. So, we have output of a
fully connected layer. We have W times X. and so we have the values
of each of these outputs, and then we’re going to apply batch normalization separately
to each of these neurons. Question? [student speaking off mic] Yeah, so the question is that for things like reinforcement learning, you might have a really small batch size. How do you deal with this? So in practice, I guess batch
normalization has been used a lot for like for standard
convolutional neural networks, and there’s actually papers
on how do we want to do normalization for different
kinds of recurrent networks, or you know some of these networks that might also be in
reinforcement learning. And so there’s different considerations that you might want to think of there. And this is still an
active area of research. There’s papers on this and we might also talk about some of this more later, but for a typical
convolutional neural network this generally works fine. And then if you have a smaller batch size, maybe this becomes a
little bit less accurate, but you still get kind of the same effect. And you know it’s possible also that you could design
your mean and variance to be computed maybe over
more examples, right, and I think in practice
usually it’s just okay, so you don’t see this too much, but this is something
that maybe could help if that was a problem. Yeah, question? [student speaking off mic] So the question, so the question is, if we force
the inputs to be gaussian, do we lose the structure? So, no in a sense that
you can think of like, if you had all your features distributed as a gaussian for example, even if you were just
doing data pre-processing, this gaussian is not
losing you any structure. All the, it’s just shifting and scaling your data into a regime, that works well for the operations that you’re going to perform on it. In convolutional layers,
you do have some structure, that you want to preserve
spatially, right. You want, like if you look
at your activation maps, you want them to relatively
all make sense to each other. So, in this case you do want to take that into consideration. And so now, we’re going to normalize, find one mean for the
entire activation map, so we only find the
empirical mean and variance over training examples. And so that’s something that you’ll be doing in your homework, and also explained in the paper as well. So, you should refer to that. Yes. [student speaking off mic] So the question is, are
we normalizing the weight so that they become gaussian. So, if I understand
your question correctly, then the answer is, we’re normalizing the
inputs to each layer, so we’re not changing the
weights in this process. [student speaking off mic] Yeah, so the question is,
once we subtract by the mean and divide by the standard deviation, does this become gaussian,
and the answer is yes. So, if you think about the
operations that are happening, basically you’re shifting
by the mean, right. And so this shift up to be zero-centered, and then you’re scaling
by the standard deviation. This now transforms this
into a unit gaussian. And so if you want to look more into that, I think you can look at, there’s a lot of machine
learning explanations that go into exactly what this, visualizing with this operation is doing, but yeah this basically takes your data and turns it into a gaussian distribution. Okay, so yeah question? [student speaking off mic] Uh-huh. So the question is, if we’re going to be
doing the shift and scale, and learning these is the
batch normalization redundant, because you could recover
the identity mapping? So in the case that the network learns that identity mapping is always the best, and it learns these parameters, the yeah, there would be no
point for batch normalization, but in practice this doesn’t happen. So in practice, we will
learn this gamma and beta. That’s not the same as a identity mapping. So, it will shift and
scale by some amount, but not the amount that’s going to give
you an identity mapping. And so what you get is you still get this batch
normalization effect. Right, so having this
identity mapping there, I’m only putting this here
to say that in the extreme, it could learn the identity mapping, but in practice it doesn’t. Yeah, question. [student speaking off mic] Yeah. [student speaking off mic] Oh, right, right. Yeah, yeah sorry, I was
not clear about this, but yeah I think this is related to the other question earlier, that yeah when we’re doing this we’re actually getting zero
mean and unit gaussian, which put this into a nice shape, but it doesn’t have to
actually be a gaussian. So yeah, I mean ideally, if we’re looking at like
inputs coming in, as you know, sort of approximately gaussian, we would like it to have
this kind of effect, but yeah, in practice
it doesn’t have to be. Okay, so … Okay, so the last thing I just
want to mention about this is that, so at test time, the
batch normalization layer, we now take the empirical mean and variance from the training data. So, we don’t re-compute this at test time. We just estimate this at training time, for example using running averages, and then we’re going to
use this as at test time. So, we’re just going to scale by that. Okay, so now I’m going to move on to babysitting the learning process. Right, so now we’ve defined
our network architecture, and we’ll talk about how
do we monitor training, and how do we adjust
hyperparameters as we go, to get good learning results? So as always, so the
first step we want to do, is we want to pre-process the data. Right, so we want to zero mean the data as we talked about earlier. Then we want to choose the architecture, and so here we are starting
with one hidden layer of 50 neurons, for example, but we’ve basically we
can pick any architecture that we want to start with. And then the first
thing that we want to do is we initialize our network. We do a forward pass through it, and we want to make sure
that our loss is reasonable. So, we talked about this
several lectures ago, where we have a basically a, let’s say we have a Softmax
classifier that we have here. We know what our loss should be, when our weights are small, and we have generally
a diffuse distribution. Then we want it to be, the
Softmax classifier loss is going to be your
negative log likelihood, which if we have 10 classes, it’ll be something like
negative log of one over 10, which here is around 2.3,
and so we want to make sure that our loss is what we expect it to be. So, this is a good
sanity check that we want to always, always do. So, now once we’ve seen that
our original loss is good, now we want to, so first we want to do this
having zero regularization, right. So, when we disable the regularization, now our only loss term is this data loss, which is going to give 2.3 here. And so here, now we want to
crank up the regularization, and when we do that, we want
to see that our loss goes up, because we’ve added this
additional regularization term. So, this is a good next step that you can do for your sanity check. And then, now we can start training. So, now we start trying to train. What we do is, a good way to do this is to start up with a
very small amount of data, because if you have just
a very small training set, you should be able to
over fit this very well and get very good training loss on here. And so in this case we want to turn off our regularization again, and just see if we can make
the loss go down to zero. And so we can see how
our loss is changing, as we have all these epochs. We compute our loss at each epoch, and we want to see this go
all the way down to zero. Right, and here we can see that also our training accuracy is going all the way up to one,
and this makes sense right. If you have a very small number of data, you should be able to
over fit this perfectly. Okay, so now once you’ve done that, these are all sanity checks. Now you can start really trying to train. So, now you can take
your full training data, and now start with a small
amount of regularization, and let’s first figure out
what’s a good learning rate. So, learning rate is one of the most important hyperparameters, and it’s something that
you want to adjust first. So, you want to try some
value of learning rate. and here I’ve tried one E negative six, and you can see that the
loss is barely changing. Right, and so the reason
this is barely changing is usually because your
learning rate is too small. So when it’s too small, your gradient updates are not big enough, and your cost is basically about the same. Okay, so, one thing that I want to point out here, is that we can notice that even though our loss with barely changing, the training and the validation accuracy jumped up to 20% very quickly. And so does anyone have any idea for why this might be the case? Why, so remember we
have a Softmax function, and our loss didn’t really change, but our accuracy improved a lot. Okay, so the reason for this is that here the probabilities
are still pretty diffuse, so our loss term is still pretty similar, but when we shift all
of these probabilities slightly in the right direction, because we’re learning right? Our weights are changing
the right direction. Now the accuracy all of a sudden can jump, because we’re taking the
maximum correct value, and so we’re going to get
a big jump in accuracy, even though our loss is
still relatively diffuse. Okay, so now if we try
another learning rate, now here I’m jumping in the other extreme, picking a very big learning
rate, one E negative six. What’s happening is that our
cost is now giving us NaNs. And, when you have NaNs,
what this usually means is that basically your cost exploded. And so, the reason for
that is typically that your learning rate was too high. So, then you can adjust your
learning rate down again. Here I can see that we’re trying three E to the negative three. The cost is still exploding. So, usually this, the rough
range for learning rates that we want to look at is between one E negative
three, and one E negative five. And, this is the rough
range that we want to be cross-validating in between. So, you want to try out
values in this range, and depending on whether
your loss is too slow, or too small, or whether it’s too large, adjust it based on this. And so how do we exactly
pick these hyperparameters? Do hyperparameter optimization, and pick the best values of
all of these hyperparameters? So, the strategy that we’re going to use is for any hyperparameter
for example learning rate, is to do cross-validation. So, cross-validation is
training on your training set, and then evaluating on a validation set. How well do this hyperparameter do? Something that you guys have already done in your assignment. And so typically we want
to do this in stages. And so, we can do first of course stage, where we pick values
pretty spread out apart, and then we learn for only a few epochs. And with only a few epochs. you can already get a pretty good sense of which hyperparameters, which values are good or not, right. You can quickly see that it’s a NaN, or you can see that nothing is happening, and you can adjust accordingly. So, typically once you do that, then you can see what’s
sort of a pretty good range, and the range that you want to now do finer sampling of values in. And so, this is the second stage, where now you might want to
run this for a longer time, and do a finer search over that region. And one tip for detecting
explosions like NaNs, you can have in your training loop, right sample some hyperparameter, start training, and then look at your cost at every iteration or every epoch. And if you ever get a cost that’s much larger than
your original cost, so for example, something like
three times original cost, then you know that this is not heading in the right direction. Right, it’s getting
very big, very quickly, and you can just break out of your loop, stop this this hyperparameter choice and pick something else. Alright, so an example of this, let’s say here we want to run now course search for five epochs. This is a similar network that we were talking about earlier, and what we can do is
we can see all of these validation accuracy that we’re getting. And I’ve put in, highlighted in red the ones that gives better values. And so these are going to be regions that we’re going to look
into in more detail. And one thing to note is that it’s usually better to
optimize in log space. And so here instead of
sampling, I’d say uniformly between you know one E to
the negative 0.01 and 100, you’re going to actually do
10 to the power of some range. Right, and this is because the learning rate is multiplying
your gradient update. And so it has these
multiplicative effects, and so it makes more sense to consider a range of learning
rates that are multiplied or divided by some value,
rather than uniformly sampled. So, you want to be dealing with orders of some magnitude here. Okay, so once you find that, you can then adjust your range. Right get in this case, we
have a range of you know, maybe of 10 to the negative four, right, to 10 to the zero power. This is a good range that
we want to narrow down into. And so we can do this again, and here we can see that we’re getting a relatively good accuracy of 53%. And so this means we’re
headed in the right direction. The one thing that I
want to point out is that here we actually have a problem. And so the problem is that we can see that our best accuracy here has a learning rate that’s about, you know, all of our good learning rates are in this E to the negative four range. Right, and since the learning
rate that we specified was going from 10 to the
negative four to 10 to the zero, that means that all the
good learning rates, were at the edge of the
range that we were sampling. And so this is bad, because this means that
we might not have explored our space sufficiently, right. We might actually want to go
to 10 to the negative five, or 10 to the negative six. There might be still better ranges if we continue shifting down. So, you want to make sure that your range kind of has the good values
somewhere in the middle, or somewhere where you get
a sense that you’ve hit, you’ve explored your range fully. Okay, and so another thing is that we can sample all of our
different hyperparameters, using a kind of grid search, right. We can sample for a fixed
set of combinations, a fixed set of values
for each hyperparameter. Sample in a grid manner
over all of these values, but in practice it’s
actually better to sample from a random layout,
so sampling random value of each hyperparameter in a range. And so what you’ll get instead is we’ll have these two
hyper parameters here that we want to sample from. You’ll get samples that look
like this right side instead. And the reason for this is
that if a function is really sort of more of a function
of one variable than another, which is usually true. Usually we have little bit more, a lower effective dimensionality
than we actually have. Then you’re going to get many more samples of the important variable that you have. You’re going to be able to see this shape in this green function
that I’ve drawn on top, showing where the good values are, compared to if you just did a grid layout where we were only able to
sample three values here, and you’ve missed where
were the good regions. Right, and so basically
we’ll get much more useful signal overall
since we have more samples of different values of
the important variable. And so, hyperparameters to play with, we’ve talked about learning rate, things like different
types of decay schedules, update types, regularization, also your network architecture, so the number of hidden units, the depth, all of these are hyperparameters
that you can optimize over. And we’ve talked about some of these, but we’ll keep talking about more of these in the next lecture. And so you can think of
this as kind of, you know, if you’re basically tuning
all the knobs right, of some turntable where you’re, you’re a neural networks practitioner. You can think of the music that’s output is the loss function that you want, and you want to adjust
everything appropriately to get the kind of output that you want. Alright, so it’s really kind
of an art that you’re doing. And in practice, you’re going to do a lot of hyperparameter optimization, a lot of cross validation. And so you know, in order to get numbers, people will run cross validation over tons of hyperparameters,
monitor all of them, see which ones are doing better, which ones are doing worse. Here we have all these loss curves. Pick the right ones, readjust, and keep going through this process. And so as I mentioned earlier, as you’re monitoring each
of these loss curves, learning rate is an important one, but you’ll get a sense for
how different learning rates, which learning rates are good and bad. So you’ll see that if you have
a very high exploding one, right, this is your loss explodes, then your learning rate is too high. If it’s too kind of linear and too flat, you’ll see that it’s too low,
it’s not changing enough. And if you get something that looks like there’s a steep
change, but then a plateau, this is also an indicator
of it being maybe too high, because in this case, you’re
taking too large jumps, and you’re not able to settle
well into your local optimum. And so a good learning rate usually ends up looking
something like this, where you have a relatively steep curve, but then it’s continuing to go down, and then you might keep adjusting your learning rate from there. And so this is something that
you’ll see through practice. Okay and just, I think
we’re very close to the end, so just one last thing
that I want to point out is than in case you ever see
learning rate loss curves, where it’s … So if you ever see loss curves
where it’s flat for a while, and then starts training all of a sudden, a potential reason could
be bad initialization. So in this case, your gradients
are not really flowing too well the beginning, so
nothing’s really learning, and then at some point, it just happens to
adjust in the right way, such that it tips over and
things just start training right? And so there’s a lot of
experience at looking at these and see what’s wrong that
you’ll get over time. And so you’ll usually want to monitor and visualize your accuracy. If you have a big gap between
your training accuracy and your validation accuracy, it usually means that you
might have overfitting and you might want to increase your regularization strength. If you have no gap, you might want to increase
your model capacity, because you haven’t overfit yet. You could potentially increase it more. And in general, we also
want to track the updates, the ratio of our weight updates
to our weight magnitudes. We can just take the norm of our parameters that we have to get a sense for how large they are, and when we have our update size, we can also take the norm of that, get a sense for how large that is, and we want this ratio to
be somewhere around 0.001. There’s a lot of variance in this range, so you don’t have to be exactly on this, but it’s just this sense of you don’t want your updates to be too
large compared to your value or too small, right? You don’t want to dominate
or to have no effect. And so this is just
something that can help debug what might be a problem. Okay, so in summary, today we’ve looked at
activation functions, data preprocessing, weight initialization, batch norm, babysitting
the learning process, and hyperparameter optimization. These are the kind of
the takeaways for each that you guys should keep in mind. Use ReLUs, subtract the mean,
use Xavier Initialization, use batch norm, and sample
hyperparameters randomly. And next time we’ll continue to talk about the training neural networks with all these different topics. Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *