Let’s start talking about convolutional neural networks again A few people have been asking what does the convolutional layers look like So you know, what transformations are happening on these input images, that mean we can do something interesting in terms of machine learning. So, what I’ve done is that I’ve trained a pretty basic network to do digit recognition. And so we can see what the convolutions are doing, what the intermediate layers look like and then hopefully what the classification is at the end, so people can get a good idea. Let’s start talking about MNIST, so MNIST is a dataset that has been around for a few years and was produced by Yann Lecun who is, I think, currently at Facebook and he’s big in deep-learning and loads of good papers And one of its early efforts into convolutional networks was LeNet, is all what we call it, which was a sort of five layerish small convolutional neural network aimed doing this MNIST dataset and it basically said, look, this convolutional neural network is going to be really good on digit recognition, the current state of the art is all the machine learning techniques, and now we’re even better than that. So what I’ve done is that I tweaked the LeNet model a bit to make it a little bit more interesting, and the I’ve basically printed out all the intermediate layers so we can see them on a few digits, so that you can see how it works So this is numbers when i say digits That’s right, just the digits 0 to 9.
In fact, it’s small, 28 pixels x 28 pixels images of any hand written digit 0 to 9. And there’s about I think 90,000, or so, of them in the dataset, so there is about 10,000 to the testing and about 80,000 for trying it. The normal LeNet network is I think a convolutional layer, followed by a pooling
— so that’s the spatial downsampling layer — followed by another convolution, followed by another spatial downsampling, and all these kind of things. Now, the thing about spatial downsampling is it’s useful in some situations but it’s not very interesting to look at, because we’re then looking at images which are very very small. So what I’ve done is that I’ve done away with that and I’m just putting lots more convolutional layers So my network is — I think if I refer back to my… 1, 2, 3, 4, 5, 6… — is 6 convolutional layers, apparently, according to my piece of paper. And then two fully connected artificial neural network layers at the end. So I’m gonna write these down, so we’ve got 1, 2, 3, 4, 5, 6 convolutional layers all of which have 5 by 5 kernels so this isn’t the standard network, this is just one that I came over myself. On this digit recognition task, any reasonable network will probably do a pretty good job, you know, if you thought a little bit about it. Because it’s just not, you know, digit recognition is not quite as hard as character recognition which in turn is not as hard as other problems and so on Is that just purely because there are less digits than there are characters ? Yeah, there is less, there is basically less variations between images. If you’re, if you’re taking lots of pictures of cats, and lot’s of pictures of dogs, there is gonna to be more variation over the images and more pixels to deal with, than there is in a 28 by 28 picture that may have a 9 in it or it may have a slightly differently shaped 9, you know. And then I’ve got my fully connected layer so I’m gonna say FC1. I’m gonna got a fully connected layer here FC2. FC1 has an output size of 500, according to my piece of paper, I’ve forgotten already and FC2 has an output size of 10 which is the digits. Now if you think back to our last video on convolutional networks, after we’ve done all of convolutions, we have some fully connected layers which actually perform the classification. And the last fully connected layer is this case is going to be however long we have of different classes, so we have 10 classes 0 to all the way to 9, so 10 outputs, when output number 2 lights up or produces a high value, that means that digit 1, which is slightly confusing cause 0 indexed, but digit 1 has been recognized. Okay So I won’t going to too much details about what effect each of these will have on the input image but you can imagine that if if you got a 28 by 28 image input then this first convolutional layout which is 5 by 5 is going to reduce the width of that and the height of that image by 4 because you know it’s not gonna go to the edges , we’re not doing any padding So you gonna then have a 24 by 24 image
and the next layer will take it down to 20 by 20
and the next layer will take it down further than that. So let’s talk about how many different kernels there are in each of these layers, So the first layers, I got 20 different kernels per layer, and then if I refer back to my model file Is that the code for? It’s actually a text document in Caffe
— because this is the log have been using to do this —
which basically explains… this way you detail what size of kernel, how many kernels, how many layers you have, which one connects to each other ones and so on. Okay. This is also slightly different to the stock MNIST model file, but it’s, you know, similar stuff, okay So, for example if we pick a layer at random, could you see this layer here ? so we’ve got 20 outputs, kernel size of 5, a stride of 1, and then this tells you how you’re going to ramdomize your weight when your begin training. So we just described all of these things to the network and it goes off and does most of the work for us. So my first 2 layers have 20 kernels, then, because I decided this was a good idea, I increase this number to 50 kernels, and 50 kernels… And I have 4 with 50 kernels. You could imagine that your input image to begin with is 28 by 28, and it’s 1 deep, because we have a grayscale image. These are grayscale digits we’re looking at here. After the first 5 by 5 convolution this image is gonna shrink to 24 by 24 but because there’s 20 different kernels, all producing their own image output, it’s going to be 24 by 24 by 20, and so on, okay I won’t draw the whole thing out because we’ve done that in the last video. but these convolutions here, because I’m not using any padding will slowly decrease the size of the image and when we get to this point here at the first fully connected layer, the image should be 4 by 4 by 50 deep, which is 800 different values. Now often we would go round that to 1×1 but that is not really necessary in this case, cause there is not too much data. So this first fully connected layer is 500 neurons long,
all of which connect to all the possible 800 values and then the final 10 comes from these 500. So let’s look at some pictures and then maybe it will be more clear, okay, It’s not a very complicated network, modern networks gets much bigger than this but this shows you the kind of things that they’re doing So I’ve printed out some examples of the kinds of things that these networks will do So let’s just use number “2” as an example This is a picture of a “2” !
It’s very exciting for people watching It’s a 28 by 28 picture of a “2”, which has been normalized so that the background is basically black and the foreground is white, okay You get slightly better results if you normalize because if they pressed lightly with their pen and maybe not have done a very firm to then maybe you sort of increase the contrast a little bit So we first gonna do a 5 by 5 convolution over this we’re gonna 20 of those, so unsurprisingly if we move this away, we gonna see a number of kernel convolutions These are performing low-level image processing tasks, just like the kind of Sobel edge detectors that I talked about in previous videos So this one for example is a kind of diagonal gradient you can see that the edges that are going diagonally are quite highlighted and there’s different orientations, so this one is horizontal, and there’s a vertical one over here, and so on We can’t do a lot of interesting things with this image after this one set of convolutions, but we’re getting there So this one is starting to be transformed,
some of them are noisier than others That’s partly due to my, you know, not having trained it very long, and partly because maybe that’s useful So, we’re gonna do another set of convolutions on all of these inputs So these are now gonna be convolutions of convolutions. They start getting a little bit smoother because we’re shrinking our image down and we’re slowly starting to find higher level features. So now you can see that the loop in the top of the “2” has been highlighted here and this one is highlighted only the horizontal bit on the top of the bottom of the 2 if that kind of makes sense. So different areas of the image are now starting to be highlighted we’re bringing in different information and as we keep going, you can see we’re going even further, so we’ve increased the number of features and you can kind of see maybe there used to be a “2” there, but we’ve extracted away the actual “2” now and we’re looking just to features. So there’s lot of diagonals here which have been highlighted, and they are highlighted in very specific nuances, because some of them are looking for some things and some of them are looking for other things. Just by looking at these pictures, it’s hard to know what exactly each of these is looking for, because they would be looking for something different if you had a different number in there. We keep going, they get smaller and smaller, and they get more and more abstract, so you’re still seeing the tip of a “2” here, and this is the image which is highlighted, things that the computer thinks are useful to learn to tell us about what it is to be a “2”, which is kind of weird concept. And we keep going and the image continue to just get smaller and smaller until we get to our final 4 by 4 images Now I’ll put a comparison of two different digits on these in a minute but you can see that obviously we’re getting very general shapes now. There’s no concept of a “2” anymore, this has been completely extracted away into which of these are lit up and where, and we connect that to a first fully connected layer which I’ve tried to print out but it’s kind of odd, which is just a bunch of activations spread out over these 500, not all of them are activated as you can see, the white ones are very strong activations, the grey ones are in the middle, and the black ones are very low activations. So you can see that, you know, these 2 are good. So in some sense it’s learned that when there is a 2 in this image, these 2 have been lighted up, and so is this one, and so on. Basically I’ve said, here is a picture of a “2”, I’d like you to output the number “2”, And it said, well okay, if I make these like this, then that would work. So it’s just following a mathematical process. So even for different original “2”s, would those two..? They would be, yeah, they would be subtly different.. I mean, if you got a really well trained algorithm, they would start to look very similar But there’s a lot of nuances here, and there’s only 10 classes, so there’s gonna be more information in here in some way than you need. And then at the end, this will be possibly easier to understand, these are our final 10 outputs. These 2 are not real, I just, they’re just left-in by mistake when they were printed. So you can see 0, 1, 2, the white one, the one that lit up is number 2, okay, so this is essentially correctly identified the “2”. Now obviously, in my program, I would read this value out and do something useful with it, I wouldn’t just print it as a block, but you get the idea. Okay. Lets have a look at “2” versus let’s say some other number. “4” is kind of like 2, in a sense that is has some sort of horizontal bits in it. I should have asked to any digits, so really “4” is nothing like a “2”, I’m talking nonsense. The first layer looks much like the one from the “2”, because it would do, because we’ve only done only one set of convolutions and they all do much the same thing. So you can see that for example in this one here, you’ve got mostly these kind of corners here that are highlighted and that’s true of the 4 as well As we’ve sort of progress it and I’ll skip a few layers, Let’s see if I can get the matching layer for the “4”. You can see that some elements are the same, and some are different. So this one here, this new one here, is darker, but it’s got a white region, that isn’t in this “2”, so this one this is starting to pick up differences between these two images now, And, you know, if you start doing this for a while you could see there are some other differences. Again, I’m showing you this because it’s interesting to see what a convolutional neural network does, but its very difficult to look at this and go
“oh of course, this is finding where the corners of the ‘4’ here” and so on You have to study it for quite a long time to work out what that is. There are people doing these sort of things, but to be honest most people would just go “oh, that’s nice and it’s works and that’s what important”. So we’ve progressed a bit further, so this is the last convolutional output before the fully connected layers. Now you can see that actually there are some quite big differences. So this one, for example, is bright in the top left and dark in the top right, for the “4”, where as it’s dark in the top left and bright in the bottom right for the “2”. At this point we’ve now extracted away anything that said exactly what the image looked like, and now we are just looking at features. So these are basically things that the computer finds useful, and now they are completly different. And as we we now look at the fully connected layers, completely different neurons have been activated. These two are now dark and there are some black ones in this “4” that aren’t in this one, and so on. So what it’s done is that it’s transformed the image using the convolutional layers into something that when it gets into the fully connected layers looks different to the computer, and that’s really useful. And then finally unsurprisingly, number “4” has lit up. So it’s successfully worked. And that’s basically what it’s look like, Now obviously if you have a much deeper convolutional network, with many more classes, this is gonna be doing lot’s more hierarchical, complex operations. But this is basically the gist of work a convolutional network would do. And how long did it take for you to do that? Oh, well, building the model took a few minutes and then to train it was a few hours because I added a few convolutional layers. It takes, you know, 40 minutes to train the most if you’re doing standard small networks, which still is not 88% accurate on these digits. And how much harder is it to do, for, say, letters It’s a bit harder, because you’ve got 26 classes, instead of, or maybe for capitals as well you’ve got even more classes. But on the other hand, if you’re providing images like this, which are very controlled, it’s not very difficult. If you’re producing any possible “a”, then it’s gonna be more challenging, but still convolutional networks can do these tasks quite easily. You just have to increase the number of convolutions, increase the number of kernels you have per layer, just to increase the amount that it can do. And then you just leave it to train a little bit longer and it seems to work. It seems to me that all these things that we see on websites these days where it says “are you a human, click this box”, are kind of the thing of the past, nowadays? Captchas, yeah, so in some sense the old captchas style that we had where you’d see like 5 or 6 letters that you’d have to type them in, they are defeated by convolutional neural networks. If someone has bothered to train a convolutional network to defeat that task one important thing to remember is that I have trained this network on a very specific set of digits If I give it some kind of captcha with digits in, particularly if there is more than one digit per image, it’s not gonna to understand, because I’ve been giving him 28 by 28 images with just 1 digit in. So to get to work on a specific captcha system, you’re gonna have to train it on that specific captcha system Now, one of the nice things that captcha systems do,
from the point of view of trying to crack them, is generate a lot of images so you just download their API and you can just generate dataset after dataset. So in some sense, image based captchas are starting to look a bit weak. On the other hand, as a researcher where I’m not really heavily interested in breaking captchas, I think it’s probably serves quite a useful purpose, so, you know, maybe a spammer is trying to do this, so you start to look into more complex captcha systems so for example Google reCAPTCHA which won’t necessarily provide you numbers will ask you “can you see all the biscuits in this image, and you’ll see a 9 by 9 grid of biscuits and then it’s slightly more complicated for a bot to interact with this HTML and it’s a slightly more complicated problem, particularly if you don’t know it’s going to ask until it does. So I guess the idea ot it is to keep changing your captcha system with enough frequency that if anyone had trained the network to solve it, it then become redundant and they can’t solve the next one. The problem is that if I obtain a cookie of you which is supposed to be secure, then I can send that to, let’s say, Amazon, or to a shop and say “I’m Shawn, please, what’s in the shoping basket ” what’s his address, what’s his credit card details.