Lecture 10 | Recurrent Neural Networks

Articles, Blog

Lecture 10 | Recurrent Neural Networks

Lecture 10 | Recurrent Neural Networks


– Okay. Can everyone hear me? Okay. Sorry for the delay. I had a bit of technical difficulty. Today was the first time
I was trying to use my new touch bar Mac book pro for presenting, and none of the adapters are working. So, I had to switch
laptops at the last minute. So, thanks. Sorry about that. So, today is lecture 10. We’re talking about
recurrent neural networks. So, as of, as usual, a
couple administrative notes. So, We’re working hard on assignment one grading. Those grades will probably be out sometime later today. Hopefully, they can get out before the A2 deadline. That’s what I’m hoping for. On a related note, Assignment
two is due today at 11:59 p.m. so, who’s done with that already? About half you guys. So, you remember, I did warn you when the assignment went out that it was quite long, to start early. So, you were warned about that. But, hopefully, you guys
have some late days left. Also, as another reminder, the midterm will be in class on Tuesday. If you kind of look
around the lecture hall, there are not enough seats in this room to seat all the enrolled
students in the class. So, we’ll actually be having the midterm in several other lecture
halls across campus. And we’ll be sending out some more details on exactly where to go in
the next couple of days. So a bit of a, another
bit of announcement. We’ve been working on this sort of fun bit of extra credit
thing for you to play with that we’re calling the training game. This is this cool
browser based experience, where you can go in and interactively train neural networks and tweak the hyper
parameters during training. And this should be a
really cool interactive way for you to practice some of these hyper
parameter tuning skills that we’ve been talking about
the last couple of lectures. So this is not required, but this, I think, will be
a really useful experience to gain a little bit more intuition into how some of these
hyper parameters work for different types of
data sets in practice. So we’re still working on getting all the bugs worked out of this setup, and we’ll probably send out some more instructions on exactly how this will work in the next couple of days. But again, not required. But please do check it out. I think it’ll be really fun and a really cool thing
for you to play with. And will give you a bit of extra credit if you do some, if you end up working with this and doing a couple of runs with it. So, we’ll again send out
some more details about this soon once we get all the bugs worked out. As a reminder, last time we were talking
about CNN Architectures. We kind of walked through the time line of some of the various winners of the image net classification challenge, kind of the breakthrough result. As we saw was the AlexNet
architecture in 2012, which was a nine layer
convolutional network. It did amazingly well, and it sort of kick started this whole deep learning
revolution in computer vision, and kind of brought a lot of these models into the mainstream. Then we skipped ahead a couple years, and saw that in 2014 image net challenge, we had these two really
interesting models, VGG and GoogLeNet, which were much deeper. So VGG was, they had a 16 and a 19 layer model, and GoogLeNet was, I believe, a 22 layer model. Although one thing that
is kind of interesting about these models is that the 2014 image net challenge was right before batch
normalization was invented. So at this time, before the invention
of batch normalization, training these relatively deep models of roughly twenty layers
was very challenging. So, in fact, both of these two models had to resort to a little bit of hackery in order to get their
deep models to converge. So for VGG, they had the
16 and 19 layer models, but actually they first
trained an 11 layer model, because that was what they
could get to converge. And then added some extra
random layers in the middle and then continued training, actually training the
16 and 19 layer models. So, managing this training process was very challenging in 2014 before the invention
of batch normalization. Similarly, for GoogLeNet, we saw that GoogLeNet has
these auxiliary classifiers that were stuck into lower
layers of the network. And these were not really
needed for the class to, to get good classification performance. This was just sort of a way to cause extra gradient to be injected directly into the lower
layers of the network. And this sort of, this again was before the
invention of batch normalization and now once you have these networks with batch normalization, then you no longer need
these slightly ugly hacks in order to get these
deeper models to converge. Then we also saw in the
2015 image net challenge was this really cool model called ResNet, these residual networks that now have these shortcut connections that actually have these
little residual blocks where we’re going to take our input, pass it through the residual blocks, and then add that output of the, then add our input to the block, to the output from these
convolutional layers. This is kind of a funny architecture, but it actually has two
really nice properties. One is that if we just set all the weights in this residual block to zero, then this block is competing the identity. So in some way, it’s relatively easy for this model to learn not to use the
layers that it doesn’t need. In addition, it kind of adds this interpretation to L2 regularization in the context of these neural networks, cause once you put L2 regularization, remember, on your, on the weights of your network, that’s going to drive all
the parameters towards zero. And maybe your standard
convolutional architecture is driving towards zero. Maybe it doesn’t make sense. But in the context of a residual network, if you drive all the
parameters towards zero, that’s kind of encouraging the model to not use layers that it doesn’t need, because it will just drive those, the residual blocks towards the identity, whether or not needed for classification. The other really useful property
of these residual networks has to do with the gradient
flow in the backward paths. If you remember what happens
at these addition gates in the backward pass, when upstream gradient is coming in through an addition gate, then it will split and fork along these two different paths. So then, when upstream gradient comes in, it’ll take one path through
these convolutional blocks, but it will also have a direct
connection of the gradient through this residual connection. So then when you look at, when you imagine stacking many of these residual blocks on top of each other, and our network ends up with hundreds of, potentially hundreds of layers. Then, these residual connections give a sort of gradient super highway for gradients to flow backward through the entire network. And this allows it to train much easier and much faster. And actually allows
these things to converge reasonably well, even when the model is potentially
hundreds of layers deep. And this idea of managing
gradient flow in your models is actually super important
everywhere in machine learning. And super prevalent in
recurrent networks as well. So we’ll definitely revisit
this idea of gradient flow later in today’s lecture. So then, we kind of also saw
a couple other more exotic, more recent CNN architectures last time, including DenseNet and FractalNet, and once you think about
these architectures in terms of gradient flow, they make a little bit more sense. These things like DenseNet and FractalNet are adding these additional shortcut or identity connections inside the model. And if you think about what happens in the backwards pass for these models, these additional funny topologies are basically providing direct paths for gradients to flow from the loss at the end of the network more easily into all the
different layers of the network. So I think that, again, this idea of managing
gradient flow properly in your CNN Architectures is something that we’ve really seen a lot more in the last couple of years. And will probably see more moving forward as more exotic architectures are invented. We also saw this kind of nice plot, plotting performance of the number of flops versus
the number of parameters versus the run time of
these various models. And there’s some
interesting characteristics that you can dive in
and see from this plot. One idea is that VGG and AlexNet have a huge number of parameters, and these parameters actually come almost entirely from the
fully connected layers of the models. So AlexNet has something like
roughly 62 million parameters, and if you look at that
last fully connected layer, the final fully connected layer in AlexNet is going from an activation
volume of six by six by 256 into this fully connected vector of 496. So if you imagine what the weight matrix needs to look like at that layer, the weight matrix is gigantic. It’s number of entries is six by six, six times six times 256 times 496. And if you multiply that out, you see that that single layer has 38 million parameters. So more than half of the parameters of the entire AlexNet model are just sitting in that
last fully connected layer. And if you add up all the parameters in just the fully connected
layers of AlexNet, including these other
fully connected layers, you see something like
59 of the 62 million parameters in AlexNet are sitting in these
fully connected layers. So then when we move other architectures, like GoogLeNet and ResNet, they do away with a lot of these large fully connected layers in favor of global average pooling at the end of the network. And this allows these
networks to really cut, these nicer architectures, to really cut down the parameter count in these architectures. So that was kind of our brief recap of the CNN architectures
that we saw last lecture, and then today, we’re going to move to one of my favorite topics to talk about, which is recurrent neural networks. So, so far in this class, we’ve seen, what I like to think of as kind of a vanilla feed forward network, all of our network
architectures have this flavor, where we receive some input and that input is a fixed size object, like an image or vector. That input is fed through
some set of hidden layers and produces a single output, like a classifications, like a set of classifications scores over a set of categories. But in some context in machine learning, we want to have more flexibility in the types of data that
our models can process. So once we move to this idea
of recurrent neural networks, we have a lot more opportunities to play around with the types
of input and output data that our networks can handle. So once we have recurrent neural networks, we can do what we call
these one to many models. Or where maybe our input is
some object of fixed size, like an image, but now our output is a
sequence of variable length, such as a caption. Where different captions might have different numbers of words, so our output needs to
be variable in length. We also might have many to one models, where our input could be variably sized. This might be something
like a piece of text, and we want to say what is
the sentiment of that text, whether it’s positive or
negative in sentiment. Or in a computer vision context, you might imagine taking as input a video, and that video might have a
variable number of frames. And now we want to read this entire video of potentially variable length. And then at the end, make a classification decision about maybe what kind
of activity or action is going on in that video. We also have a, we
might also have problems where we want both the inputs and the output to be variable in length. We might see something like this in machine translation, where our input is some,
maybe, sentence in English, which could have a variable length, and our output is maybe
some sentence in French, which also could have a variable length. And crucially, the length
of the English sentence might be different from the
length of the French sentence. So we need some models
that have the capacity to accept both variable length sequences on the input and on the output. Finally, we might also
consider problems where our input is variably length, like something like a video sequence with a variable number of frames. And now we want to make a decision for each element of that input sequence. So in the context of videos, that might be making some
classification decision along every frame of the video. And recurrent neural networks are this kind of general paradigm for handling variable sized sequence data that allow us to pretty naturally capture all of these different types
of setups in our models. So recurring neural networks
are actually important, even for some problems that
have a fixed size input and a fixed size output. Recurrent neural networks
can still be pretty useful. So in this example, we might want to do, for example, sequential processing of our input. So here, we’re receiving
a fixed size input like an image, and we want to make a
classification decision about, like, what number is
being shown in this image? But now, rather than just doing
a single feed forward pass and making the decision all at once, this network is actually
looking around the image and taking various glimpses of
different parts of the image. And then after making
some series of glimpses, then it makes its final decision as to what kind of number is present. So here, we had one, So here, even though
our input and outputs, our input was an image, and our output was a
classification decision, even this context, this idea of being able to handle variably length processing
with recurrent neural networks can lead to some really
interesting types of models. There’s a really cool paper that I like that applied this same type of idea to generating new images. Where now, we want the model
to synthesize brand new images that look kind of like the
images it saw in training, and we can use a recurrent
neural network architecture to actually paint these output images sort of one piece at a time in the output. You can see that, even though our output
is this fixed size image, we can have these models
that are working over time to compute parts of the output
one at a time sequentially. And we can use recurrent neural networds for that type of setup as well. So after this sort of cool pitch about all these cool
things that RNNs can do, you might wonder, like what
exactly are these things? So in general, a recurrent neural network is this little, has this
little recurrent core cell and it will take some input x, feed that input into the RNN, and that RNN has some
internal hidden state, and that internal hidden
state will be updated every time that the RNN reads a new input. And that internal hidden state will be then fed back to the model the next time it reads an input. And frequently, we will want our RNN”s to also produce some
output at every time step, so we’ll have this pattern
where it will read an input, update its hidden state, and then produce an output. So then the question is what is the functional form
of this recurrence relation that we’re computing? So inside this little green RNN block, we’re computing some recurrence relation, with a function f. So this function f will
depend on some weights, w. It will accept the previous
hidden state, h t – 1, as well as the input at
the current state, x t, and this will output the next hidden state, or
the updated hidden state, that we call h t. And now, then as we read the next input, this hidden state, this new hidden state, h t, will then just be passed
into the same function as we read the next input, x t plus one. And now, if we wanted
to produce some output at every time step of this network, we might attach some additional
fully connected layers that read in this h t at every time step. And make that decision based on the hidden state at every time step. And one thing to note is that we use the same function, f w, and the same weights, w, at every time step of the computation. So then kind of the simplest function form that you can imagine is what we call this vanilla
recurrent neural network. So here, we have this same functional form from the previous slide, where we’re taking in
our previous hidden state and our current input and we need to produce
the next hidden state. And the kind of simplest
thing you might imagine is that we have some weight matrix, w x h, that we multiply against the input, x t, as well as another weight matrix, w h h, that we multiply against
the previous hidden state. So we make these two multiplications against our two states, add them together, and squash them through a tanh, so we get some kind of non
linearity in the system. You might be wondering
why we use a tanh here and not some other type of non-linearity? After all that we’ve said negative about tanh’s in previous lectures, and I think we’ll return
a little bit to that later on when we talk about more advanced architectures, like lstm. So then, this, So then, in addition in this architecture, if we wanted to produce
some y t at every time step, you might have another weight matrix, w, you might have another weight matrix that accepts this hidden state and then transforms it to some y to produce maybe some
class score predictions at every time step. And when I think about
recurrent neural networks, I kind of think about, you can also, you can kind of think of
recurrent neural networks in two ways. One is this concept of
having a hidden state that feeds back at itself, recurrently. But I find that picture
a little bit confusing. And sometimes, I find it clearer to think about unrolling
this computational graph for multiple time steps. And this makes the data
flow of the hidden states and the inputs and the
outputs and the weights maybe a little bit more clear. So then at the first time step, we’ll have some initial
hidden state h zero. This is usually initialized
to zeros for most context, in most contexts, an then
we’ll have some input, x t. This initial hidden state, h zero, and our current input, x t, will go into our f w function. This will produce our
next hidden state, h one. And then, we’ll repeat this process when we receive the next input. So now our current h one and our x one, will go into that same f w, to produce our next output, h two. And this process will
repeat over and over again, as we consume all of the input, x ts, in our sequence of inputs. And now, one thing to note, is that we can actually make
this even more explicit and write the w matrix in
our computational graph. And here you can see that we’re re-using the same w matrix at every time step of the computation. So now every time that we
have this little f w block, it’s receiving a unique h and a unique x, but all of these blocks
are taking the same w. And if you remember, we talked about how gradient
flows in back propagation, when you re-use the same, when you re-use the
same node multiple times in a computational graph, then remember during the backward pass, you end up summing the gradients into the w matrix when you’re computing a d los d w. So, if you kind of think about the back propagation for this model, then you’ll have a separate gradient for w flowing from each of those time steps, and then the final gradient for w will be the sum of all of those individual per time step gradiants. We can also write to this y t explicitly in this computational graph. So then, this output,
h t, at every time step might feed into some other
little neural network that can produce a y t, which might be some class
scores, or something like that, at every time step. We can also make the loss more explicit. So in many cases, you
might imagine producing, you might imagine that you
have some ground truth label at every time step of your sequence, and then you’ll compute some loss, some individual loss, at every time step of these outputs, y t’s. And this loss might, it will frequently be
something like soft max loss, in the case where you have, maybe, a ground truth label at every
time step of the sequence. And now the final loss for the entire, for this entire training stop, will be the sum of
these individual losses. So now, we had a scaler
loss at every time step? And we just summed them
up to get our final scaler loss at the top of the network. And now, if you think about, again, back propagation
through this thing, we need, in order to train the model, we need to compute the gradient of the loss with respect to w. So, we’ll have loss flowing
from that final loss into each of these time steps. And then each of those time steps will compute a local
gradient on the weights, w, which will all then be
summed to give us our final gradient for the weights, w. Now if we have a, sort of,
this many to one situation, where maybe we want to do
something like sentiment analysis, then we would typically make that decision based on the final hidden
state of this network. Because this final hidden state kind of summarizes all of the context from the entire sequence. Also, if we have a kind of
a one to many situation, where we want to receive a fix sized input and then produce a variably sized output. Then you’ll commonly use
that fixed size input to initialize, somehow, the initial hidden state of the model, and now the recurrent network will tick for each cell in the output. And now, as you produce
your variably sized output, you’ll unroll the graph for
each element in the output. So this, when we talk about
the sequence to sequence models where you might do something
like machine translation, where you take a variably sized input and a variably sized output. You can think of this as a combination of the many to one, plus a one to many. So, we’ll kind of proceed in two stages, what we call an encoder and a decoder. So if you’re the encoder, we’ll receive the variably sized input, which might be your sentence in English, and then summarize that entire sentence using the final hidden state
of the encoder network. And now we’re in this
many to one situation where we’ve summarized this
entire variably sized input in this single vector, and now, we have a second decoder network, which is a one to many situation, which will input that single vector summarizing the input sentence and now produce this
variably sized output, which might be your sentence
in another language. And now in this variably sized output, we might make some predictions
at every time step, maybe about what word to use. And you can imagine kind of
training this entire thing by unrolling this computational graph summing the losses at the output sequence and just performing back
propagation, as usual. So as a bit of a concrete example, one thing that we frequently use recurrent neural networks for, is this problem called language modeling. So in the language modeling problem, we want to read some sequence of, we want to have our
network, sort of, understand how to produce natural language. So in the, so this, this might
happen at the character level where our model will produce
characters one at a time. This might also happen at the word level where our model will
produce words one at a time. But in a very simple example, you can imagine this
character level language model where we want, where the network will read
some sequence of characters and then it needs to predict, what will the next character
be in this stream of text? So in this example, we have this very small
vocabulary of four letters, h, e, l, and o, and we have
this example training sequence of the word hello, h, e, l, l, o. So during training, when we’re training this language model, we will feed the characters
of this training sequence as inputs, as x ts, to out input of our, we’ll feed the characters
of our training sequence, these will be the x ts that
we feed in as the inputs to our recurrent neural network. And then, each of these inputs, it’s a letter, and we need to figure out a way to represent letters in our network. So what we’ll typically do is figure out what is our total vocabulary. In this case, our vocabulary
has four elements. And each letter will be
represented by a vector that has zeros in every slot but one, and a one for the slot in the vocabulary corresponding to that letter. In this little example, since our vocab has the
four letters, h, e, l, o, then our input sequence, the h is represented by
a four element vector with a one in the first slot and zero’s in the other three slots. And we use the same sort of pattern to represent all the different letters in the input sequence. Now, during this forward pass of what this network is doing, at the first time step, it will receive the input letter h. That will go into the first RNN, to the RNN cell, and then we’ll produce this output, y t, which is the network making predictions about for each letter in the vocabulary, which letter does it think is most likely going to come next. In this example, the correct output letter was e because our training sequence was hello, but the model is actually predicting, I think it’s actually predicting o as the most likely letter. So in this case, this prediction was wrong and we would use softmaxt loss to quantify our unhappiness
with these predictions. The next time step, we would feed in the second letter in the training sequence, e, and this process will repeat. We’ll now represent e as a vector. Use that input vector together with the previous hidden state to produce a new hidden state and now use the second hidden state to, again, make predictions over every letter in the vocabulary. In this case, because our
training sequence was hello, after the letter e, we want our model to predict l. In this case, our model may have very low predictions for the letter l, so we
would incur high loss. And you kind of repeat
this process over and over, and if you train this model
with many different sequences, then eventually it should learn how to predict the next
character in a sequence based on the context of
all the previous characters that it’s seen before. And now, if you think about
what happens at test time, after we train this model, one thing that we might want to do with it is a sample from the model, and actually use this
trained neural network model to synthesize new text that kind of looks similar in spirit to the text that it was trained on. The way that this will work is we’ll typically see the model with some input prefix of text. In this case, the prefix is
just the single letter h, and now we’ll feed that letter h through the first time step of
our recurrent neural network. It will product this
distribution of scores over all the characters in the vocabulary. Now, at training time,
we’ll use these scores to actually sample from it. So we’ll use a softmaxt function to convert those scores into
a probability distribution and then we will sample from
that probability distribution to actually synthesize the
second letter in the sequence. And in this case, even though
the scores were pretty bad, maybe we got lucky and
sampled the letter e from this probability distribution. And now, we’ll take this letter e that was sampled from this distribution and feed it back as input into the network at the next time step. Now, we’ll take this e,
pull it down from the top, feed it back into the network as one of these, sort of, one
hot vectorial representations, and then repeat the process in order to synthesize the
second letter in the output. And we can repeat this
process over and over again to synthesize a new sequence
using this trained model, where we’re synthesizing the sequence one character at a time using these predicted
probability distributions at each time step. Question? Yeah, that’s a great question. So the question is why might we sample instead of just taking the character with the largest score? In this case, because of the probability
distribution that we had, it was impossible to
get the right character, so we had the sample so
the example could work out, and it would make sense. But in practice,
sometimes you’ll see both. So sometimes you’ll just
take the argmax probability, and that will sometimes be
a little bit more stable, but one advantage of sampling, in general, is that it lets you get
diversity from your models. Sometimes you might have the same input, maybe the same prefix, or in the case of image captioning, maybe the same image. But then if you sample rather
than taking the argmax, then you’ll see that
sometimes these trained models are actually able to produce
multiple different types of reasonable output sequences, depending on the kind, depending on which samples they take at the first time steps. It’s actually kind of a benefit cause we can get now more
diversity in our outputs. Another question? Could we feed in the softmax vector instead of the one element vector? You mean at test time? Yeah yeah, so the
question is, at test time, could we feed in this whole softmax vector rather than a one hot vector? There’s kind of two problems with that. One is that that’s very different from the data that it
saw at training time. In general, if you ask your model to do something at test time, which is different from training time, then it’ll usually blow up. It’ll usually give you garbage and you’ll usually be sad. The other problem is that in practice, our vocabularies might be very large. So maybe, in this simple example, our vocabulary is only four elements, so it’s not a big problem. But if you’re thinking about
generating words one at a time, now your vocabulary is every
word in the English language, which could be something like
tens of thousands of elements. So in practice, this first element, this first operation that’s
taking in this one hot vector, is often performed using
sparse vector operations rather than dense factors. It would be, sort of,
computationally really bad if you wanted to have this load of 10,000 elements softmax vector. So that’s usually why we
use a one hot instead, even at test time. This idea that we have a sequence and we produce an output at
every time step of the sequence and then finally compute some loss, this is sometimes called
backpropagation through time because you’re imagining
that in the forward pass, you’re kind of stepping
forward through time and then during the backward pass, you’re sort of going
backwards through time to compute all your gradients. This can actually be kind of problematic if you want to train the sequences
that are very, very long. So if you imagine that we
were kind of trying to train a neural network language model on maybe the entire text of Wikipedia, which is, by the way, something that people
do pretty frequently, this would be super slow, and every time we made a gradient step, we would have to make a forward pass through the entire text
of all of wikipedia, and then make a backward pass
through all of wikipedia, and then make a single gradient update. And that would be super slow. Your model would never converge. It would also take a
ridiculous amount of memory so this would be just really bad. In practice, what people
do is this, sort of, approximation called truncated
backpropagation through time. Here, the idea is that, even though our input
sequence is very, very long, and even potentially infinite, what we’ll do is that during, when we’re training the model, we’ll step forward for
some number of steps, maybe like a hundred is
kind of a ballpark number that people frequently use, and we’ll step forward
for maybe a hundred steps, compute a loss only over this
sub sequence of the data, and then back propagate
through this sub sequence, and now make a gradient step. And now, when we repeat, well, we still have these hidden states that we computed from the first batch, and now, when we compute
this next batch of data, we will carry those hidden
states forward in time, so the forward pass will
be exactly the same. But now when we compute a gradient step for this next batch of data, we will only backpropagate
again through this second batch. Now, we’ll make a gradient step based on this truncated
backpropagation through time. This process will continue, where now when we make the next batch, we’ll again copy these
hidden states forward, but then step forward
and then step backward, but only for some small
number of time steps. So this is, you can kind of think of this as being an alegist who’s
the cast at gradient descent in the case of sequences. Remember, when we talked
about training our models on large data sets, then these data sets, it would be super expensive
to compute the gradients over every element in the data set. So instead, we kind of take small samples, small mini batches instead, and use mini batches of data
to compute gradient stops in any kind of image classification case. Question? Is this kind of, the question is, is this kind of making
the Mark Hobb assumption? No, not really. Because we’re carrying
this hidden state forward in time forever. It’s making a Marcovian assumption in the sense that, conditioned
on the hidden state, but the hidden state is all that we need to predict the entire future of the sequence. But that assumption is kind of built into the recurrent neural network formula from the start. And that’s not really particular to back propagation through time. Back propagation through time, or sorry, truncated back prop though time is just the way to
approximate these gradients without going making a backwards pass through your potentially
very large sequence of data. This all sounds very
complicated and confusing and it sounds like a lot of code to write, but in fact, this can
acutally be pretty concise. Andrea has this example of
what he calls min-char-rnn, that does all of this stuff in just like a 112 lines of Python. It handles building the vocabulary. It trains the model with truncated back
propagation through time. And then, it can actually
sample from that model in actually not too much code. So even though this sounds like kind of a big, scary process, it’s actually not too difficult. I’d encourage you, if you’re confused, to maybe go check this out and step through the
code on your own time, and see, kind of, all
of these concrete steps happening in code. So this is all in just a single file, all using numpy with no dependencies. This was relatively easy to read. So then, once we have this idea of training a recurrent
neural network language model, we can actually have a
lot of fun with this. And we can take in, sort
of, any text that we want. Take in, like, whatever
random text you can think of from the internet, train our recurrent neural
network language model on this text, and then generate new text. So in this example, we
took this entire text of all of Shakespeare’s works, and then used that to train a recurrent neural network language model on all of Shakespeare. And you can see that the
beginning of training, it’s kind of producing maybe
random gibberish garbage, but throughout the course of training, it ends up producing things
that seem relatively reasonable. And after you’ve, after this model has
been trained pretty well, then it produces text that seems, kind of, Shakespeare-esque to me. “Why do what that day,” replied, whatever, right, you can read this. Like, it kind of looks
kind of like Shakespeare. And if you actually train
this model even more, and let it converge even further, and then sample these
even longer sequences, you can see that it learns
all kinds of crazy cool stuff that really looks like a Shakespeare play. It knows that it uses,
maybe, these headings to say who’s speaking. Then it produces these bits of text that have crazy dialogue that sounds kind of Shakespeare-esque. It knows to put line breaks in between these different things. And this is all, like, really cool, all just sort of learned from
the structure of the data. We can actually get
even crazier than this. This was one of my favorite examples. I found online, there’s this. Is anyone a mathematician in this room? Has anyone taken an algebraic
topology course by any chance? Wow, a couple, that’s impressive. So you probably know more
algebraic topology than me, but I found this open source algebraic topology textbook online. It’s just a whole bunch of tech files that are like this
super dense mathematics. And LaTac, cause LaTac is sort of this, let’s you write equations and diagrams and everything just using plain text. We can actually train our recurrent neural network language model on the raw Latac source code of this algebraic topology textbook. And if we do that, then after
we sample from the model, then we get something that seems like, kind of like algebraic topology. So it knows to like put equations. It puts all kinds of crazy stuff. It’s like, to prove study, we see that F sub U is
a covering of x prime, blah, blah, blah, blah, blah. It knows where to put unions. It knows to put squares
at the end of proofs. It makes lemmas. It makes references to previous lemmas. Right, like we hear, like. It’s namely a bi-lemma question. We see that R is geometrically something. So it’s actually pretty crazy. It also sometimes tries to make diagrams. For those of you that have
taken algebraic topology, you know that these commutative diagrams are kind of a thing
that you work with a lot So it kind of got the general gist of how to make those diagrams, but they actually don’t make any sense. And actually, one of my favorite examples here is that it sometimes omits proofs. So it’ll sometimes say, it’ll sometimes say something like theorem, blah, blah, blah,
blah, blah, proof omitted. This thing kind of has gotten the gist of how some of these
math textbooks look like. We can have a lot of fun with this. So we also tried training
one of these models on the entire source
code of the Linux kernel. ‘Cause again, this character level stuff that we can train on, And then, when we sample this, it acutally again looks
like C source code. It knows how to write if statements. It has, like, pretty good
code formatting skills. It knows to indent after
these if statements. It knows to put curly braces. It actually even makes
comments about some things that are usually nonsense. One problem with this model is that it knows how to declare variables. But it doesn’t always use the
variables that it declares. And sometimes it tries to use variables that
haven’t been declared. This wouldn’t compile. I would not recommend sending this as a pull request to Linux. This thing also figures
out how to recite the GNU, this GNU license character by character. It kind of knows that you
need to recite the GNU license and after the license comes some includes, then some other includes,
then source code. This thing has actually
learned quite a lot about the general structure of the data. Where, again, during training, all we asked this model to do was try to predict the next
character in the sequence. We didn’t tell it any of this structure, but somehow, just through the course of this training process, it learned a lot about
the latent structure in the sequential data. Yeah, so it knows how to write code. It does a lot of cool stuff. I had this paper with
Andre a couple years ago where we trained a bunch of these models and then we wanted to try
to poke into the brains of these models and figure out like what are they doing and why are they working. So we saw, in our, these recurring neural networks has this hidden vector which is, maybe, some vector that’s
updated over every time step. And then what we wanted
to try to figure out is could we find some elements of this vector that have some Symantec
interpretable meaning. So what we did is we trained a neural
network language model, one of these character level models on one of these data sets, and then we picked one of the
elements in that hidden vector and now we look at what is the
value of that hidden vector over the course of a sequence to try to get some sense of maybe what these different hidden
states are looking for. When you do this, a lot
of them end up looking kind of like random gibberish garbage. So here again, what we’ve done, is we’ve picked one
element of that vector, and now we run the sequence forward through the trained model, and now the color of each character corresponds to the
magnitude of that single scaler element of the hidden
vector at every time step when it’s reading the sequence. So you can see that a lot of the vectors in these hidden states are kind of not very interpretable. It seems like they’re
kind of doing some of this low level language modeling to figure out what
character should come next. But some of them end up quite nice. So here we found this vector
that is looking for quotes. You can see that there’s
this one hidden element, this one element in the vector, that is off, off, off, off, off blue and then once it hits a quote, it turns on and remains
on for the duration of this quote. And now when we hit the
second quotation mark, then that cell turns off. So somehow, even though
this model was only trained to predict the next
character in a sequence, it somehow learned that a useful thing, in order to do this, might be to have some cell
that’s trying to detect quotes. We also found this other cell that is, looks like it’s
counting the number of characters since a line break. So you can see that at the
beginning of each line, this element starts off at zero. Throughout the course of the line, it’s gradually more red, so that value increases. And then after the new line character, it resets to zero. So you can imagine that maybe this cell is letting the network keep track of when it needs to write to produce these new line characters. We also found some that, when we trained on the linux source code, we found some examples that are turning on inside the conditions of if statements. So this maybe allows the network to differentiate whether
it’s outside an if statement or inside that condition, which might help it model
these sequences better. We also found some that
turn on in comments, or some that seem like they’re counting the number of indentation levels. This is all just really cool stuff because it’s saying that even though we are only
trying to train this model to predict next characters, it somehow ends up learning a lot of useful structure about the input data. One kind of thing that we often use, so this is not really been
computer vision so far, and we need to pull this
back to computer vision since this is a vision class. We’ve alluded many times to this image captioning model where we want to build
models that can input an image and then output a
caption in natural language. There were a bunch of
papers a couple years ago that all had relatively
similar approaches. But I’m showing the figure
from the paper from our lab in a totally un-biased way. But, the idea here is that the caption is this variably length
sequence that we might, the sequence might have different numbers of words for different captions. So this is a totally natural fit for a recurrent neural
network language model. So then what this model looks like is we have some convolutional network which will input the, which will take as input the image, and we’ve seen a lot about how convolution networks work at this point, and that convolutional
network will produce a summary vector of the image which will then feed
into the first time step of one of these recurrent
neural network language models which will then produce words
of the caption one at a time. So the way that this kind
of works at test time after the model is trained looks almost exactly the same as these character level language models that we saw a little bit ago. We’ll take our input image, feed it through our convolutional network. But now instead of
taking the softmax scores from an image net model, we’ll instead take this
4,096 dimensional vector from the end of the model, and we’ll take that vector and use it to summarize the whole
content of the image. Now, remember when we talked
about RNN language models, we said that we need to
see the language model with that first initial input to tell it to start generating text. So in this case, we’ll give
it some special start token, which is just saying, hey, this
is the start of a sentence. Please start generating some text conditioned on this image information. So now previously, we saw that
in this RNN language model, we had these matrices that
were taking the previous, the input at the current time step and the hidden state of
the previous time step and combining those to
get the next hidden state. Well now, we also need to add
in this image information. So one way, people play
around with exactly different ways to incorporate
this image information, but one simple way is just to add a third weight matrix that is adding in this image
information at every time step to compute the next hidden state. So now, we’ll compute this distribution over all scores in our vocabulary and here, our vocabulary is something like all English words, so it could be pretty large. We’ll sample from that distribution and now pass that word back as
input at the next time step. And that will then feed that word in, again get a distribution
over all words in the vocab, and again sample to produce the next word. So then, after that thing is all done, we’ll maybe generate, we’ll generate this complete sentence. We stop generation once we
sample the special ends token, which kind of corresponds to the period at the end of the sentence. Then once the network
samples this ends token, we stop generation and we’re done and we’ve gotten our
caption for this image. And now, during training, we trained this thing to generate, like we put an end token at the end of every caption during training so that the network kind
of learned during training that end tokens come at
the end of sequences. So then, during test time, it tends to sample these end tokens once it’s done generating. So we trained this model in kind of a completely supervised way. You can find data sets that have images together with
natural language captions. Microsoft COCO is probably the biggest and most widely used for this task. But you can just train this model in a purely supervised way. And then backpropagate
through to jointly train both this recurrent neural
network language model and then also pass gradients back into this final layer of this the CNN and additionally update the weights of the CNN to jointly tune
all parts of the model to perform this task. Once you train these models, they actually do some
pretty reasonable things. These are some real results from a model, from one of these trained models, and it says things like a cat sitting on a suitcase on the floor, which is pretty impressive. It knows about cats
sitting on a tree branch, which is also pretty cool. It knows about two people walking on the beach with surfboards. So these models are
actually pretty powerful and can produce relatively
complex captions to describe the image. But that being said, these models are really not perfect. They’re not magical. Just like any machine learning model, if you try to run them on data that was very different
from the training data, they don’t work very well. So for example, this example, it says a woman is
holding a cat in her hand. There’s clearly no cat in the image. But she is wearing a fur coat, and maybe the texture of that coat kind of looked like a cat to the model. Over here, we see a
woman standing on a beach holding a surfboard. Well, she’s definitely
not holding a surfboard and she’s doing a handstand, which is maybe the interesting
part of that image, and the model totally missed that. Also, over here, we see this example where there’s this picture of a spider web in the tree branch, and it totally, and it says something like a bird sitting on a tree branch. So it totally missed the spider, but during training, it never really saw examples of spiders. It just knows that birds sit on tree branches during training. So it kind of makes these
reasonable mistakes. Or here at the bottom, it can’t really tell the difference between this guy throwing
and catching the ball, but it does know that
it’s a baseball player and there’s balls and things involved. So again, just want to
say that these models are not perfect. They work pretty well when
you ask them to caption images that were similar to the training data, but they definitely have a hard time generalizing far beyond that. So another thing you’ll sometimes see is this slightly more advanced
model called Attention, where now when we’re generating
the words of this caption, we can allow the model
to steer it’s attention to different parts of the image. And I don’t want to spend
too much time on this. But the general way
that this works is that now our convolutional network, rather than producing a single vector summarizing the entire image, now it produces some grid of vectors that summarize the, that give maybe one vector
for each spatial location in the image. And now, when we, when this model runs forward, in addition to sampling the
vocabulary at every time step, it also produces a distribution over the locations in the image where it wants to look. And now this distribution
over image locations can be seen as a kind of a tension of where the model should
look during training. So now that first hidden state computes this distribution
over image locations, which then goes back to the set of vectors to give a single summary vector that maybe focuses the attention
on one part of that image. And now that summary vector gets fed, as an additional input, at the next time step
of the neural network. And now again, it will
produce two outputs. One is our distribution
over vocabulary words. And the other is a distribution
over image locations. This whole process will continue, and it will sort of do
these two different things at every time step. And after you train the model, then you can see that it kind of will shift it’s attention around the image for every word that it
generates in the caption. Here you can see that it produced the caption,
a bird is flying over, I can’t see that far. But you can see that its attention is shifting around
different parts of the image for each word in the
caption that it generates. There’s this notion of hard attention versus soft attention, which I don’t really want
to get into too much, but with this idea of soft attention, we’re kind of taking
a weighted combination of all features from all image locations, whereas in the hard attention case, we’re forcing the model to
select exactly one location to look at in the image at each time step. So the hard attention case where we’re selecting
exactly one image location is a little bit tricky because that is not really
a differentiable function, so you need to do
something slightly fancier than vanilla backpropagation in order to just train the
model in that scenario. And I think we’ll talk about
that a little bit later in the lecture on reinforcement learning. Now, when you look at after you train one of these attention models and then run it on to generate captions, you can see that it tends
to focus it’s attention on maybe the salient or
semanticly meaningful part of the image when generating captions. You can see that the caption was a woman is throwing a frisbee in a park and you can see that this attention mask, when it generated the word, when the model generated the word frisbee, at the same time, it was focusing it’s
attention on this image region that actually contains the frisbee. This is actually really cool. We did not tell the model
where it should be looking at every time step. It sort of figured all that out for itself during the training process. Because somehow, it
figured out that looking at that image region was
the right thing to do for this image. And because everything in
this model is differentiable, because we can backpropagate through all these soft attention steps, all of this soft attention stuff just comes out through
the training process. So that’s really, really cool. By the way, this idea of
recurrent neural networks and attention actually
gets used in other tasks beyond image captioning. One recent example is this idea of visual question answering. So here, our model is going
to take two things as input. It’s going to take an image and it will also take a
natural language question that’s asking some
question about the image. Here, we might see this image on the left and we might ask the question, what endangered animal
is featured on the truck? And now the model needs to select from one of these four natural language answers about which of these answers
correctly answers that question in the context of the image. So you can imagine kind of
stitching this model together using CNNs and RNNs in
kind of a natural way. Now, we’re in this many to one scenario, where now our model needs to take as input this natural language sequence, so we can imagine running
a recurrent neural network over each element of that input question, to now summarize the input
question in a single vector. And then we can have a CNN
to again summarize the image, and now combine both
the vector from the CNN and the vector from the
question and coding RNN to then predict a
distribution over answers. We also sometimes, you’ll also sometimes see this idea of soft spacial attention
being incorporated into things like visual
question answering. So you can see that here, this model is also having
the spatial attention over the image when it’s trying to determine answers to the questions. Just to, yeah, question? So the question is How are the different inputs combined? Do you mean like the
encoded question vector and the encoded image vector? Yeah, so the question is how are the encoded image and the encoded question vector combined? Kind of the simplest thing to do is just to concatenate them and stick them into
fully connected layers. That’s probably the most common and that’s probably
the first thing to try. Sometimes people do
slightly fancier things where they might try to have
multiplicative interactions between those two vectors to allow a more powerful function. But generally, concatenation
is kind of a good first thing to try. Okay, so now we’ve talked
about a bunch of scenarios where RNNs are used for
different kinds of problems. And I think it’s super cool because it allows you to start tackling really complicated problems combining images and computer vision with natural language processing. And you can see that we
can kind of stith together these models like Lego blocks and attack really complicated things, Like image captioning or
visual question answering just by stitching together
these relatively simple types of neural network modules. But I’d also like to mention that so far, we’ve talked about this idea of a single recurrent network layer, where we have sort of one hidden state, and another thing that
you’ll see pretty commonly is this idea of a multilayer
recurrent neural network. Here, this is a three layer
recurrent neural network, so now our input goes in, goes into, goes in and produces
a sequence of hidden states from the first recurrent
neural network layer. And now, after we run kind of one recurrent neural network layer, then we have this whole
sequence of hidden states. And now, we can use the
sequence of hidden states as an input sequence to another recurrent neural network layer. And then you can just imagine, which will then produce another
sequence of hidden states from the second RNN layer. And then you can just imagine stacking these things
on top of each other, cause we know that we’ve
seen in other contexts that deeper models tend to perform better for various problems. And the same kind of
holds in RNNs as well. For many problems, you’ll see maybe a two or three layer recurrent
neural network model is pretty commonly used. You typically don’t see
super deep models in RNNs. So generally, like two,
three, four layer RNNs is maybe as deep as you’ll typically go. Then, I think it’s also really
interesting and important to think about, now we’ve seen kind of what kinds of problems
these RNNs can be used for, but then you need to think
a little bit more carefully about exactly what happens to these models when we try to train them. So here, I’ve drawn this
little vanilla RNN cell that we’ve talked about so far. So here, we’re taking
our current input, x t, and our previous hidden
state, h t minus one, and then we stack, those are two vectors. So we can just stack them together. And then perform this
matrix multiplication with our weight matrix, to give our, and then squash that
output through a tanh, and that will give us
our next hidden state. And that’s kind of the
basic functional form of this vanilla recurrent neural network. But then, we need to think about what happens in this architecture during the backward pass when
we try to compute gradients? So then if we think
about trying to compute, so then during the backwards pass, we’ll receive the derivative of our h t, we’ll receive derivative of loss with respect to h t. And during the backward
pass through the cell, we’ll need to compute derivative of loss to the respect of h t minus one. Then, when we compute this backward pass, we see that the gradient flows backward through this red path. So first, that gradient
will flow backwards through this tanh gate, and then it will flow backwards through this matrix multiplication gate. And then, as we’ve seen in the homework and when implementing these
matrix multiplication layers, when you backpropagate through this matrix multiplication gate, you end up mulitplying by the transpose of that weight matrix. So that means that every
time we backpropagate through one of these vanilla RNN cells, we end up multiplying by some
part of the weight matrix. So now if you imagine
that we are sticking many of these recurrent neural
network cells in sequence, because again this is an RNN. We want a model sequences. Now if you imagine what
happens to the gradient flow through a sequence of these layers, then something kind of
fishy starts to happen. Because now, when we want to compute the gradient of the loss
with respect to h zero, we need to backpropagate through every one of these RNN cells. And every time you
backpropagate through one cell, you’ll pick up one of
these w transpose factors. So that means that the final expression for the gradient on h zero will involve many, many factors of this weight matrix, which could be kind of bad. Maybe don’t think about the weight, the matrix case, but imagine a scaler case. If we end up, if we have some scaler and we multiply by that
same number over and over and over again, maybe not for four examples, but for something like a hundred or several hundred time steps, then multiplying by the same number over and over again is really bad. In the scaler case, it’s either going to explode in the case that that
number is greater than one or it’s going to vanish towards zero in the case that number is less than one in absolute value. And the only way in which
this will not happen is if that number is exactly one, which is actually very
rare to happen in practice. That leaves us to, that same intuition
extends to the matrix case, but now, rather than the absolute
value of a scaler number, you instead need to look at the largest, the largest singular value
of this weight matrix. Now if that largest singular
value is greater than one, then during this backward pass, when we multiply by the
weight matrix over and over, that gradient on h w, on h zero, sorry, will become very, very large, when that matrix is too large. And that’s something we call
the exploding gradient problem. Where now this gradient will
explode exponentially in depth with the number of time steps that we backpropagate through. And if the largest singular
value is less than one, then we get the opposite problem, where now our gradients will shrink and shrink and shrink exponentially, as we backpropagate and pick
up more and more factors of this weight matrix. That’s called the
vanishing gradient problem. THere’s a bit of a hack
that people sometimes do to fix the exploding gradient problem called gradient clipping, which is just this simple heuristic saying that after we compute our gradient, if that gradient, if it’s L2 norm is above some threshold, then just clamp it down and divide, just clamp it down so it
has this maximum threshold. This is kind of a nasty hack, but it actually gets used
in practice quite a lot when training recurrent neural networks. And it’s a relatively useful tool for attacking this
exploding gradient problem. But now for the vanishing
gradient problem, what we typically do is we might need to move to a more complicated RNN architecture. So that motivates this idea of an LSTM. An LSTM, which stands for
Long Short Term Memory, is this slightly fancier
recurrence relation for these recurrent neural networks. It’s really designed to help alleviate this problem of vanishing
and exploding gradients. So that rather than kind
of hacking on top of it, we just kind of design the architecture to have better gradient flow properties. Kind of an analogy to those
fancier CNN architectures that we saw at the top of the lecture. Another thing to point out is that the LSTM cell
actually comes from 1997. So this idea of an LSTM has been around for quite a while, and these folks were
working on these ideas way back in the 90s, were definitely ahead of the curve. Because these models are
kind of used everywhere now 20 years later. And LSTMs kind of have
this funny functional form. So remember when we had this vanilla recurrent neural network, it had this hidden state. And we used this recurrence relation to update the hidden
state at every time step. Well, now in an LSTM, we actually have two, we maintain two hidden
states at every time step. One is this h t, which is called the hidden state, which is kind of an
analogy to the hidden state that we had in the vanilla RNN. But an LSTM also maintains
the second vector, c t, called the cell state. And the cell state is this
vector which is kind of internal, kept inside the LSTM, and it does not really get
exposed to the outside world. And we’ll see, and you can kind of see that
through this update equation, where you can see that when we, first when we compute these, we take our two inputs, we use them to compute these four gates called i, f, o, n, g. We use those gates to
update our cell states, c t, and then we expose part of our cell state as the hidden state at the next time step. This is kind of a funny functional form, and I want to walk through
for a couple slides exactly why do we use this architecture and why does it make sense, especially in the context of vanishing or exploding gradients. This first thing that we do in an LSTM is that we’re given this
previous hidden state, h t, and we’re given our
current input vector, x t, and just like the vanilla RNN. In the vanilla RNN, remember, we took those two input vectors. We concatenated them. Then we did a matrix multiply to directly compute the next
hidden state in the RNN. Now, the LSTM does something
a little bit different. We’re going to take our
previous hidden state and our current input, stack them, and now multiply by a
very big weight matrix, w, to compute four different gates, Which all have the same
size as the hidden state. Sometimes, you’ll see this
written in different ways. Some authors will write
a different weight matrix for each gate. Some authors will combine them all into one big weight matrix. But it’s all really the same thing. The ideas is that we
take our hidden state, our current input, and then we use those to
compute these four gates. These four gates are the, you often see this written
as i, f, o, g, ifog, which makes it pretty easy
to remember what they are. I is the input gate. It says how much do we want
to input into our cell. F is the forget gate. How much do we want to
forget the cell memory at the previous, from
the previous time step. O is the output gate, which is how much do we
want to reveal ourself to the outside world. And G really doesn’t have a nice name, so I usually call it the gate gate. G, it tells us how much
do we want to write into our input cell. And then you notice that
each of these four gates are using a different non linearity. The input, forget and output gate are all using sigmoids, which means that their values
will be between zero and one. Whereas the gate gate uses a tanh, which means it’s output will
be between minus one and one. So, these are kind of weird, but it makes a little bit more sense if you imagine them all as binary values. Right, like what happens at the extremes of these two values? It’s kind of what happens, if you look after we compute these gates if you look at this next equation, you can see that our cell state is being multiplied element
wise by the forget gate. Sorry, our cell state from
the previous time step is being multiplied element
wise by this forget gate. And now if this forget gate, you can think of it as being
a vector of zeros and ones, that’s telling us for each
element in the cell state, do we want to forget
that element of the cell in the case if the forget gate was zero? Or do we want to remember
that element of the cell in the case if the forget gate was one. Now, once we’ve used the forget gate to gate off the part of the cell state, then we have the second term, which is the element
wise product of i and g. So now, i is this vector
of zeros and ones, cause it’s coming through a sigmoid, telling us for each
element of the cell state, do we want to write to that
element of the cell state in the case that i is one, or do we not want to write to
that element of the cell state at this time step in the case that i is zero. And now the gate gate, because it’s coming through a tanh, will be either one or minus one. So that is the value that we want, the candidate value that
we might consider writing to each element of the cell
state at this time step. Then if you look at the
cell state equation, you can see that at every time step, the cell state has these kind of these different,
independent scaler values, and they’re all being incremented
or decremented by one. So there’s kind of like, inside the cell state,
we can either remember or forget our previous state, and then we can either
increment or decrement each element of that cell state by up to one at each time step. So you can kind of think of
these elements of the cell state as being little scaler integer counters that can be incremented and decremented at each time step. And now, after we’ve
computed our cell state, then we use our now updated cell state to compute a hidden state, which we will reveal to the outside world. So because this cell state
has this interpretation of being counters, and sort of counting up by one or minus one at each time step, we want to squash that counter value into a nice zero to
one range using a tanh. And now, we multiply element wise, by this output gate. And the output gate is again
coming through a sigmoid, so you can think of it as
being mostly zeros and ones, and the output gate tells us for each element of our cell state, do we want to reveal or not reveal that element of our cell state when we’re computing the
external hidden state for this time step. And then, I think there’s
kind of a tradition in people trying to explain LSTMs, that everyone needs to come up with their own potentially
confusing LSTM diagram. So here’s my attempt. Here, we can see what’s going
on inside this LSTM cell, is that we take our, we’re taking as input on the
left our previous cell state and the previous hidden state, as well as our current input, x t. Now we’re going to take our current, our previous hidden state, as well as our current input, stack them, and then multiply with
this weight matrix, w, to produce our four gates. And here, I’ve left
out the non linearities because we saw those on a previous slide. And now the forget gate
multiplies element wise with the cell state. The input and gate gate
are multiplied element wise and added to the cell state. And that gives us our next cell. The next cell gets
squashed through a tanh, and multiplied element
wise with this output gate to produce our next hidden state. Question? No, So they’re coming through this, they’re coming from different
parts of this weight matrix. So if our hidden, if our x and our h all
have this dimension h, then after we stack them, they’ll be a vector size two h, and now our weight matrix
will be this matrix of size four h times two h. So you can think of that as sort of having four chunks of this weight matrix. And each of these four
chunks of the weight matrix is going to compute a
different one of these gates. You’ll often see this written for clarity, kind of combining all
four of those different weight matrices into a
single large matrix, w, just for notational convenience. But they’re all computed using different parts
of the weight matrix. But you’re correct in
that they’re all computed using the same functional form of just stacking the two things and taking the matrix multiplication. Now that we have this picture, we can think about what
happens to an LSTM cell during the backwards pass? We saw, in the context of vanilla recurrent neural network, that some bad things happened
during the backwards pass, where we were continually multiplying by that weight matrix, w. But now, the situation looks much, quite a bit different in the LSTM. If you imagine this path backwards of computing the gradients
of the cell state, we get quite a nice picture. Now, when we have our upstream gradient from the cell coming in, then once we backpropagate backwards through this addition operation, remember that this addition just copies that upstream gradient
into the two branches, so our upstream gradient
gets copied directly and passed directly to backpropagating through this element wise multiply. So then our upstream
gradient ends up getting multiplied element wise
by the forget gate. As we backpropagate backwards
through this cell state, the only thing that happens to our upstream cell state gradient is that it ends up getting
multiplied element wise by the forget gate. This is really a lot nicer than the vanilla RNN for two reasons. One is that this forget gate is now an element wise multiplication rather than a full matrix multiplication. So element wise multiplication is going to be a little bit nicer than full matrix multiplication. Second is that element wise multiplication will potentially be
multiplying by a different forget gate at every time step. So remember, in the vanilla RNN, we were continually multiplying
by that same weight matrix over and over again, which led very explicitly to these exploding or vanishing gradients. But now in the LSTM case, this forget gate can
vary from each time step. Now, it’s much easier for the model to avoid these problems of exploding and vanishing gradients. Finally, because this forget gate is coming out from a sigmoid, this element wise multiply is guaranteed to be between zero and one, which again, leads to sort
of nicer numerical properties if you imagine multiplying by these things over and over again. Another thing to notice
is that in the context of the vanilla recurrent neural network, we saw that during the backward pass, our gradients were flowing
through also a tanh at every time step. But now, in an LSTM, our outputs are, in an LSTM, our hidden state is used to compute those outputs, y t, so now, each hidden state, if you imagine backpropagating
from the final hidden state back to the first cell state, then through that backward path, we only backpropagate through
a single tanh non linearity rather than through a separate
tanh at every time step. So kind of when you put
all these things together, you can see this backwards pass backpropagating through the cell state is kind of a gradient super highway that lets gradients pass
relatively unimpeded from the loss at the very end of the model all the way back to the initial cell state at the beginning of the model. Was there a question? Yeah, what about the
gradient in respect to w? ‘Cause that’s ultimately the
thing that we care about. So, the gradient with respect to w will come through, at every time step, will take our current cell state as well as our current hidden state and that will give us an element, that will give us our local gradient on w for that time step. So because our cell state, and just in the vanilla RNN case, we’ll end up adding those
first time step w gradients to compute our final gradient on w. But now, if you imagine the situation where we have a very long sequence, and we’re only getting
gradients to the very end of the sequence. Now, as you backpropagate through, we’ll get a local gradient on w for each time step, and that local gradient on w will be coming through
these gradients on c and h. So because we’re maintaining
the gradients on c much more nicely in the LSTM case, those local gradients
on w at each time step will also be carried forward and backward through time much more cleanly. Another question? Yeah, so the question is due to the non linearities, could this still be susceptible
to vanishing gradients? And that could be the case. Actually, so one problem you might imagine is that maybe if these forget gates are always less than zero, or always less than one, you might get vanishing gradients as you continually go
through these forget gates. Well, one sort of trick
that people do in practice is that they will, sometimes, initialize the biases of the forget gate to be somewhat positive. So that at the beginning of training, those forget gates are
always very close to one. So that at least at the
beginning of training, then we have not so,
relatively clean gradient flow through these forget gates, since they’re all
initialized to be near one. And then throughout
the course of training, then the model can learn those biases and kind of learn to
forget where it needs to. You’re right that there
still could be some potential for vanishing gradients here. But it’s much less extreme than the vanilla RNN case, both because those fs can
vary at each time step, and also because we’re doing this element wise multiplication rather than a full matrix multiplication. So you can see that this LSTM actually looks quite similar to ResNet. In this residual network, we had this path of identity connections going backward through the network and that gave, sort of
a gradient super highway for gradients to flow backward in ResNet. And now it’s kind of the
same intuition in LSTM where these additive and element wise multiplicative interactions
of the cell state can give a similar gradient super highway for gradients to flow backwards
through the cell state in an LSTM. And by the way, there’s this
other kind of nice paper called highway networks, which is kind of in between this idea of this LSTM cell and these residual networks. So these highway networks actually came before residual networks, and they had this idea where at every layer of the highway network, we’re going to compute sort of a candidate activation, as well as a gating function that tells us that interprelates between our previous input at that layer, and that candidate activation that came through our
convolutions or what not. So there’s actually a lot of
architectural similarities between these things, and people take a lot of inspiration from training very deep CNNs and very deep RNNs and there’s a lot of crossover here. Very briefly, you’ll see a
lot of other types of variance of recurrent neural network
architectures out there in the wild. Probably the most common,
apart from the LSTM, is this GRU, called the
gated recurrent unit. And you can see those
update equations here, and it kind of has this
similar flavor of the LSTM, where it uses these
multiplicative element wise gates together with these additive interactions to avoid this vanishing gradient problem. There’s also this cool paper called LSTM: a search based oddysey, very inventive title, where they tried to play
around with the LSTM equations and swap out the non
linearities at one point, like do we really need that tanh for exposing the output gate, and they tried to answer a lot
of these different questions about each of those non linearities, each of those pieces of
the LSTM update equations. What happens if we change the model and tweak those LSTM
equations a little bit. And kind of the conclusion is that they all work about the same Some of them work a little
bit better than others for one problem or another. But generally, none of the things, none of the tweaks of LSTM that they tried were significantly better
that the original LSTM for all problems. So that gives you a little bit more faith that the LSTM update
equations seem kind of magical but they’re useful anyway. You should probably consider
them for your problem. There’s also this cool paper
from Google a couple years ago where they tried to use, where they did kind of
an evolutionary search and did a search over many, over a very large number of
random RNN architectures, they kind of randomly premute
these update equations and try putting the additions
and the multiplications and the gates and the non linearities in different kinds of combinations. They blasted this out over
their huge Google cluster and just tried a whole bunch of these different weigh
updates in various flavors. And again, it was the same story that they didn’t really find anything that was significantly better than these existing GRU or LSTM styles. Although there were some
variations that worked maybe slightly better or
worse for certain problems. But kind of the take away is that probably and using an LSTM or GRU is not so much magic in those equations, but this idea of managing
gradient flow properly through these additive connections and these multiplicative gates is super useful. So yeah, the summary is
that RNNs are super cool. They can allow you to attack
tons of new types of problems. They sometimes are
susceptible to vanishing or exploding gradients. But we can address that
with weight clipping and with fancier architectures. And there’s a lot of cool overlap between CNN architectures
and RNN architectures. So next time, you’ll
be taking the midterm. But after that, we’ll
have a, sorry, a question? Midterm is after this lecture so anything up to this point is fair game. And so you guys, good luck
on the midterm on Tuesday.

Leave a Reply

Your email address will not be published. Required fields are marked *