Inside a Neural Network – Computerphile

Articles, Blog

Inside a Neural Network – Computerphile

Inside a Neural Network – Computerphile


Let’s start talking about convolutional neural networks again A few people have been asking what does the convolutional layers look like So you know, what transformations are happening on these input images, that mean we can do something interesting in terms of machine learning. So, what I’ve done is that I’ve trained a pretty basic network to do digit recognition. And so we can see what the convolutions are doing, what the intermediate layers look like and then hopefully what the classification is at the end, so people can get a good idea. Let’s start talking about MNIST, so MNIST is a dataset that has been around for a few years and was produced by Yann Lecun who is, I think, currently at Facebook and he’s big in deep-learning and loads of good papers And one of its early efforts into convolutional networks was LeNet, is all what we call it, which was a sort of five layerish small convolutional neural network aimed doing this MNIST dataset and it basically said, look, this convolutional neural network is going to be really good on digit recognition, the current state of the art is all the machine learning techniques, and now we’re even better than that. So what I’ve done is that I tweaked the LeNet model a bit to make it a little bit more interesting, and the I’ve basically printed out all the intermediate layers so we can see them on a few digits, so that you can see how it works So this is numbers when i say digits That’s right, just the digits 0 to 9.
In fact, it’s small, 28 pixels x 28 pixels images of any hand written digit 0 to 9. And there’s about I think 90,000, or so, of them in the dataset, so there is about 10,000 to the testing and about 80,000 for trying it. The normal LeNet network is I think a convolutional layer, followed by a pooling
— so that’s the spatial downsampling layer — followed by another convolution, followed by another spatial downsampling, and all these kind of things. Now, the thing about spatial downsampling is it’s useful in some situations but it’s not very interesting to look at, because we’re then looking at images which are very very small. So what I’ve done is that I’ve done away with that and I’m just putting lots more convolutional layers So my network is — I think if I refer back to my… 1, 2, 3, 4, 5, 6… — is 6 convolutional layers, apparently, according to my piece of paper. And then two fully connected artificial neural network layers at the end. So I’m gonna write these down, so we’ve got 1, 2, 3, 4, 5, 6 convolutional layers all of which have 5 by 5 kernels so this isn’t the standard network, this is just one that I came over myself. On this digit recognition task, any reasonable network will probably do a pretty good job, you know, if you thought a little bit about it. Because it’s just not, you know, digit recognition is not quite as hard as character recognition which in turn is not as hard as other problems and so on Is that just purely because there are less digits than there are characters ? Yeah, there is less, there is basically less variations between images. If you’re, if you’re taking lots of pictures of cats, and lot’s of pictures of dogs, there is gonna to be more variation over the images and more pixels to deal with, than there is in a 28 by 28 picture that may have a 9 in it or it may have a slightly differently shaped 9, you know. And then I’ve got my fully connected layer so I’m gonna say FC1. I’m gonna got a fully connected layer here FC2. FC1 has an output size of 500, according to my piece of paper, I’ve forgotten already and FC2 has an output size of 10 which is the digits. Now if you think back to our last video on convolutional networks, after we’ve done all of convolutions, we have some fully connected layers which actually perform the classification. And the last fully connected layer is this case is going to be however long we have of different classes, so we have 10 classes 0 to all the way to 9, so 10 outputs, when output number 2 lights up or produces a high value, that means that digit 1, which is slightly confusing cause 0 indexed, but digit 1 has been recognized. Okay So I won’t going to too much details about what effect each of these will have on the input image but you can imagine that if if you got a 28 by 28 image input then this first convolutional layout which is 5 by 5 is going to reduce the width of that and the height of that image by 4 because you know it’s not gonna go to the edges , we’re not doing any padding So you gonna then have a 24 by 24 image
and the next layer will take it down to 20 by 20
and the next layer will take it down further than that. So let’s talk about how many different kernels there are in each of these layers, So the first layers, I got 20 different kernels per layer, and then if I refer back to my model file Is that the code for? It’s actually a text document in Caffe
— because this is the log have been using to do this —
which basically explains… this way you detail what size of kernel, how many kernels, how many layers you have, which one connects to each other ones and so on. Okay. This is also slightly different to the stock MNIST model file, but it’s, you know, similar stuff, okay So, for example if we pick a layer at random, could you see this layer here ? so we’ve got 20 outputs, kernel size of 5, a stride of 1, and then this tells you how you’re going to ramdomize your weight when your begin training. So we just described all of these things to the network and it goes off and does most of the work for us. So my first 2 layers have 20 kernels, then, because I decided this was a good idea, I increase this number to 50 kernels, and 50 kernels… And I have 4 with 50 kernels. You could imagine that your input image to begin with is 28 by 28, and it’s 1 deep, because we have a grayscale image. These are grayscale digits we’re looking at here. After the first 5 by 5 convolution this image is gonna shrink to 24 by 24 but because there’s 20 different kernels, all producing their own image output, it’s going to be 24 by 24 by 20, and so on, okay I won’t draw the whole thing out because we’ve done that in the last video. but these convolutions here, because I’m not using any padding will slowly decrease the size of the image and when we get to this point here at the first fully connected layer, the image should be 4 by 4 by 50 deep, which is 800 different values. Now often we would go round that to 1×1 but that is not really necessary in this case, cause there is not too much data. So this first fully connected layer is 500 neurons long,
all of which connect to all the possible 800 values and then the final 10 comes from these 500. So let’s look at some pictures and then maybe it will be more clear, okay, It’s not a very complicated network, modern networks gets much bigger than this but this shows you the kind of things that they’re doing So I’ve printed out some examples of the kinds of things that these networks will do So let’s just use number “2” as an example This is a picture of a “2” !
It’s very exciting for people watching It’s a 28 by 28 picture of a “2”, which has been normalized so that the background is basically black and the foreground is white, okay You get slightly better results if you normalize because if they pressed lightly with their pen and maybe not have done a very firm to then maybe you sort of increase the contrast a little bit So we first gonna do a 5 by 5 convolution over this we’re gonna 20 of those, so unsurprisingly if we move this away, we gonna see a number of kernel convolutions These are performing low-level image processing tasks, just like the kind of Sobel edge detectors that I talked about in previous videos So this one for example is a kind of diagonal gradient you can see that the edges that are going diagonally are quite highlighted and there’s different orientations, so this one is horizontal, and there’s a vertical one over here, and so on We can’t do a lot of interesting things with this image after this one set of convolutions, but we’re getting there So this one is starting to be transformed,
some of them are noisier than others That’s partly due to my, you know, not having trained it very long, and partly because maybe that’s useful So, we’re gonna do another set of convolutions on all of these inputs So these are now gonna be convolutions of convolutions. They start getting a little bit smoother because we’re shrinking our image down and we’re slowly starting to find higher level features. So now you can see that the loop in the top of the “2” has been highlighted here and this one is highlighted only the horizontal bit on the top of the bottom of the 2 if that kind of makes sense. So different areas of the image are now starting to be highlighted we’re bringing in different information and as we keep going, you can see we’re going even further, so we’ve increased the number of features and you can kind of see maybe there used to be a “2” there, but we’ve extracted away the actual “2” now and we’re looking just to features. So there’s lot of diagonals here which have been highlighted, and they are highlighted in very specific nuances, because some of them are looking for some things and some of them are looking for other things. Just by looking at these pictures, it’s hard to know what exactly each of these is looking for, because they would be looking for something different if you had a different number in there. We keep going, they get smaller and smaller, and they get more and more abstract, so you’re still seeing the tip of a “2” here, and this is the image which is highlighted, things that the computer thinks are useful to learn to tell us about what it is to be a “2”, which is kind of weird concept. And we keep going and the image continue to just get smaller and smaller until we get to our final 4 by 4 images Now I’ll put a comparison of two different digits on these in a minute but you can see that obviously we’re getting very general shapes now. There’s no concept of a “2” anymore, this has been completely extracted away into which of these are lit up and where, and we connect that to a first fully connected layer which I’ve tried to print out but it’s kind of odd, which is just a bunch of activations spread out over these 500, not all of them are activated as you can see, the white ones are very strong activations, the grey ones are in the middle, and the black ones are very low activations. So you can see that, you know, these 2 are good. So in some sense it’s learned that when there is a 2 in this image, these 2 have been lighted up, and so is this one, and so on. Basically I’ve said, here is a picture of a “2”, I’d like you to output the number “2”, And it said, well okay, if I make these like this, then that would work. So it’s just following a mathematical process. So even for different original “2”s, would those two..? They would be, yeah, they would be subtly different.. I mean, if you got a really well trained algorithm, they would start to look very similar But there’s a lot of nuances here, and there’s only 10 classes, so there’s gonna be more information in here in some way than you need. And then at the end, this will be possibly easier to understand, these are our final 10 outputs. These 2 are not real, I just, they’re just left-in by mistake when they were printed. So you can see 0, 1, 2, the white one, the one that lit up is number 2, okay, so this is essentially correctly identified the “2”. Now obviously, in my program, I would read this value out and do something useful with it, I wouldn’t just print it as a block, but you get the idea. Okay. Lets have a look at “2” versus let’s say some other number. “4” is kind of like 2, in a sense that is has some sort of horizontal bits in it. I should have asked to any digits, so really “4” is nothing like a “2”, I’m talking nonsense. The first layer looks much like the one from the “2”, because it would do, because we’ve only done only one set of convolutions and they all do much the same thing. So you can see that for example in this one here, you’ve got mostly these kind of corners here that are highlighted and that’s true of the 4 as well As we’ve sort of progress it and I’ll skip a few layers, Let’s see if I can get the matching layer for the “4”. You can see that some elements are the same, and some are different. So this one here, this new one here, is darker, but it’s got a white region, that isn’t in this “2”, so this one this is starting to pick up differences between these two images now, And, you know, if you start doing this for a while you could see there are some other differences. Again, I’m showing you this because it’s interesting to see what a convolutional neural network does, but its very difficult to look at this and go
“oh of course, this is finding where the corners of the ‘4’ here” and so on You have to study it for quite a long time to work out what that is. There are people doing these sort of things, but to be honest most people would just go “oh, that’s nice and it’s works and that’s what important”. So we’ve progressed a bit further, so this is the last convolutional output before the fully connected layers. Now you can see that actually there are some quite big differences. So this one, for example, is bright in the top left and dark in the top right, for the “4”, where as it’s dark in the top left and bright in the bottom right for the “2”. At this point we’ve now extracted away anything that said exactly what the image looked like, and now we are just looking at features. So these are basically things that the computer finds useful, and now they are completly different. And as we we now look at the fully connected layers, completely different neurons have been activated. These two are now dark and there are some black ones in this “4” that aren’t in this one, and so on. So what it’s done is that it’s transformed the image using the convolutional layers into something that when it gets into the fully connected layers looks different to the computer, and that’s really useful. And then finally unsurprisingly, number “4” has lit up. So it’s successfully worked. And that’s basically what it’s look like, Now obviously if you have a much deeper convolutional network, with many more classes, this is gonna be doing lot’s more hierarchical, complex operations. But this is basically the gist of work a convolutional network would do. And how long did it take for you to do that? Oh, well, building the model took a few minutes and then to train it was a few hours because I added a few convolutional layers. It takes, you know, 40 minutes to train the most if you’re doing standard small networks, which still is not 88% accurate on these digits. And how much harder is it to do, for, say, letters It’s a bit harder, because you’ve got 26 classes, instead of, or maybe for capitals as well you’ve got even more classes. But on the other hand, if you’re providing images like this, which are very controlled, it’s not very difficult. If you’re producing any possible “a”, then it’s gonna be more challenging, but still convolutional networks can do these tasks quite easily. You just have to increase the number of convolutions, increase the number of kernels you have per layer, just to increase the amount that it can do. And then you just leave it to train a little bit longer and it seems to work. It seems to me that all these things that we see on websites these days where it says “are you a human, click this box”, are kind of the thing of the past, nowadays? Captchas, yeah, so in some sense the old captchas style that we had where you’d see like 5 or 6 letters that you’d have to type them in, they are defeated by convolutional neural networks. If someone has bothered to train a convolutional network to defeat that task one important thing to remember is that I have trained this network on a very specific set of digits If I give it some kind of captcha with digits in, particularly if there is more than one digit per image, it’s not gonna to understand, because I’ve been giving him 28 by 28 images with just 1 digit in. So to get to work on a specific captcha system, you’re gonna have to train it on that specific captcha system Now, one of the nice things that captcha systems do,
from the point of view of trying to crack them, is generate a lot of images so you just download their API and you can just generate dataset after dataset. So in some sense, image based captchas are starting to look a bit weak. On the other hand, as a researcher where I’m not really heavily interested in breaking captchas, I think it’s probably serves quite a useful purpose, so, you know, maybe a spammer is trying to do this, so you start to look into more complex captcha systems so for example Google reCAPTCHA which won’t necessarily provide you numbers will ask you “can you see all the biscuits in this image, and you’ll see a 9 by 9 grid of biscuits and then it’s slightly more complicated for a bot to interact with this HTML and it’s a slightly more complicated problem, particularly if you don’t know it’s going to ask until it does. So I guess the idea ot it is to keep changing your captcha system with enough frequency that if anyone had trained the network to solve it, it then become redundant and they can’t solve the next one. The problem is that if I obtain a cookie of you which is supposed to be secure, then I can send that to, let’s say, Amazon, or to a shop and say “I’m Shawn, please, what’s in the shoping basket ” what’s his address, what’s his credit card details.

100 thoughts on Inside a Neural Network – Computerphile

  1. It’s like a David Lynch movie to me: I almost think I understood it and then everything just becomes a convoluted mess and I feel dumber than before…

  2. what a fantastic explanation, I loved the digits convolution representation
    hope to see more videos about this!
    (RNNs)

  3. It would be nice, if you talked a bit about how much data is needed for a CNN to be any kind of useful. The datasets in this video seem extremely big. Specifically it would be nice to have an idea on how well it works on many "categories" with a low amount of data.

  4. What if you teach it a face at all 180 degrees (of the front). Then reverse the process and ask it what it thinks the face looks like at a particular angle? (I just described computer graphics in 2030, btw.)

  5. A thought that I've gotten when thinking about this and the previous episode, would it be possible to "reverse" the order of the convolutional neural network, getting a sort of idealized result, probably not extremely useful in most cases, but likely somewhat usable for seeing what extra data can be used to train it for more accurate results or perhaps some sort of data generation.
    Doing the same for a standard neural network would not result in any useable data I know, but it seems like it might be possible with the convolutional one.

  6. I am one of those strange people who draws a horizontal bar through the number 7. How would you deal with that? Would you need a separate set of 7+bar training digits (in effect an 11th character) and then map both 7 and 7+bar back to 7?

  7. With all the edge detection going on, would it be harder to recognize a 4 if some versions had the top parts join at an angle, like the 4 in this font, versus the open version as in the video? Likewise a 7 with or without the strike through it? I mean, does it remember some kind of average of all the objects in a class or all of them / all of the sufficiently different ones (which might be hard for a large database)?

  8. I am curious, would it be possible to run this sort of neural network in reverse in order to produce the sort of "Deep Dream" images that you can see on the Internet? For instance, instead of asking the network 'what digit dose this image resemble?', ask 'what dose a 2 look like?'

  9. I don't care about the frames-per-second count of this video, unlike a few threads here. It is as off-topic as it gets!
    Let's talk about neural networks.  So, I'd like to know how does neural network measure the confidence level of the final detection. Sure, it has a certain accuracy (however a human must check all of detections manually, I suppose), but how can it measure confidence of a specific detection result. I have seen that feature in some OCR software.

  10. So useful. As a CS student, this was more helpful than a ton of other DLNN stuff I've seen online. Thank you!

  11. Seventeen sausage and forty four. leaves five men on the moon and convoluted xrays at the speed of light adding to the mainframe network before the neural transmission eats relatively slowly in comparison

  12. It would have been way more interesting to see different examples of the same number and how it tranlates into the same output.

  13. Shouldn't the network really have 11 outputs?

    10 for each of the digits and another one to say "not a digit / not sure".

    By only having 10 outputs you're forcing it to give an answer, even when there isn't a sensible answer for it to give – as you've shown it the letter "X" or a picture of Mickey Mouse instead of a digit – or, indeed, when you do give it a digit but it's not able to definitively say whether it's a 7 or a 9 because it's not a particularly clear image.

    The network ought to be given a "none of the above" alternative output because "not a digit" is a valid and intelligent answer, and this also gives you a kind of confidence test. If it really can't make it out, then it can reply "I don't know" which is a more sensible answer than it just giving an answer for the sake of giving an answer, as it can't not give some sort of answer.

    Sometimes, "I don't know" is exactly 100% the correct response to give.

  14. How do you decide what the convolution kernels should be? Is that important, or could they be defined randomly at the beginning?

  15. How are the outputs of the multiple kernels at each layer managed? Are they somehow merged so that the kernels of the next layer all process the same input? Or do the 20 kernels of layer 2 operate on the 20 outputs of the layer 1 kernels respectively? And if the latter, then what happens when moving from a 20 kernel layer to a 50 kernel layer? Would some of the 20 kernels of the previous layer be duplicated twice, and others duplicated three times to make up the inputs to the 50 kernels in the new layer?

  16. I haven even noticed the video was at 50fps. Probably because the video in most part is out of focus.

  17. would i be wrong in thinking that if you gave a convolutional neural network the ability to control where to click and what to type and gave it enough convolutions and kernals (perhaps beyond what current computers can handle) and trained it enough then it would be able to solve any captcha, even a new one with different interface that still used the same basic principles?

  18. Actually, The correct pronunciation of Le-Net is Lo-Net. "Le" in French is like "The" in in English but just for masculine.

  19. Shouldn't one be able to generate characters(letters, whatever) by going the other way around? I'm thinking what if you tell it to generate a picture from a fully connected layer?

  20. people interested in this experiment, you can actually do it in the Machine Learning course (Stanford) on Coursera

  21. Decided to learn Convolution Neural Networks after watching this video, and I have started my own self driving car project (it's in a very basic stage at the moment). Thank you 🙂

  22. Seeing that a lot of people are confused by this video being 50fps, I'd want to clear that up. 50fps is a standard frame rate for television and video in general. 60fps is a standard for animated and generated images, like animations, or games. Sure, you can do either with both, but it's generally so that high-frame-rate TV broadcast are always 50, not 60 fps.
    The scale for TV: 25, 50, 100, 200 Hz
    The scale for Computers: 30, 60, 120 Hz
    (Hz = fps)

  23. Its a really good lecture to understand what is going on inside NN. I am using NN for target classification in thermal images. Is NN is a good approach to do that ? Or I should go for any other option.

  24. Dr Pound is the best lecturer here. Very clear, intelligently funny, interesting topics.
    Would deserve his own channel <– that good!

  25. If the first convolution layer has 20 filter and the second one has 20, does thing mean that each C2 filter processes all 20 images from C1? That would make 400 images for C2 output

  26. After watching this, the one thing I don't feel is completely explained is where the convolution kernel values come from. At first he says they are things "like Sobel edge detectors", but later says they are not manually entered, but rather learned values. That leaves the obvious question of how are they initialized? Do they start as just matrices with random entries? During the training, how are they adjusted? Is the "training" some kind of iterative search for kernel values that give the strongest response (e.g. the values that most consistently uniquely identify the one digit being learned and most strongly reject the other 8 digits?) I could use a bit more explanation on what the training process looks like and how it adjusts all the kernels.

  27. Why are we shrinking the images by ignoring edges? Can't we just deal with the edges without centering into the image? It just seems like an arbitrary limitation, maybe what's most significant about a "2" is around the edges!

  28. Can someone explain how the final convolutional layer is 4x4x50? My understanding based on the previous Neural Network video is that the first convolution will produce an output of 24x24x20, but then wouldn't the next convolution, which has 20 kernels, produce 20 images of the first image layer of the 20 produced from the first convolution, and then another 20 on the second image layer of the 20 produced from the first convolution, such that at the end of the second layer you'd have a 20x20x400 output, and so forth until at the end you'd have 4x4x(some large number) not 4x4x50?

  29. May be I didn't get the idea, but why there is 10, and not 11 classes for numbers?
    Because if I will give an image of "A" to this network, it will probably say to me "this is 1" or "this is 4" instead of giving the negative answer like "its neither of 10 numbers".

  30. at one point binary flashed on the screen as a background, here's a part of the translation: E¤ªÉü

  31. How to you replicate the learned connections to other systems? How is the "knowledge" abstracted for transport, backup, and further improvements?

    With discrete programming, the instructions are compact and finite and are easily copied.

  32. 3:43 but wouldn't it mean that the digit's 2? because we're starting at index 0, and index 0 is 0, so index 2 is 2.

  33. I realize this isn't likely to get a reply this late, but I'm trying to replicate the configuration of this network. What activation function are you using for the first fully connected layer? Is it dotplus with a renormalization? I'm assuming FC2 is a softmax layer, so maybe they are both softmax.

  34. Computerphile, you single handedly helped me regain my interest with computer science.
    Thank you very much for all your videos (:

  35. Thank you Mike, and thank you Shaun, this video is really helping me in my quest! I'm making a small game in which I'm trying to make an AI using the tensorflow library.

  36. This is the best explaination of what is going on inside a neural net! Now I can imagine it more clearly
    Thanks alot!

  37. Very cool!
    @5:24 Grayscale is quite a few bits deep, 1-bit depth would be Black & White ( which is not the case in your images, looks like you have at least 16-bit images – if not 256-bit standard grayscale – )

  38. Let's see if someone can help me out here. The first layer here outputs 20 24×24 images (or a 20 channel image) after performing all the convolutions. The second layer will output 20 20×20 images. But how are they constructed? How do they combine the 20 channels from the previous layer? I mean, they are not applying all 20 filters to each of the 20 channels, that'd be a 400 channel output. Do they simple add the convolutions for each channel up? So channel 1 of layer 2 is the sum of the convolution between kernel 1 of layer 2 with each of the 20 channels of layer 1?

  39. I don't know man..but I can't understand how a neural network works..I've watched a ton of videos..maybe I am just dumb..it's like I know but I don't know.

  40. You said a 5×5 convolution layer will bring the 28×28 image down by 4 to 24×24. I'm not getting that math lol. 28 / 5 = 5 with remainder 3 no? And then you go to 20×20. What am I missing? Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *