How Deep Neural Networks Work

Articles, Blog

How Deep Neural Networks Work

How Deep Neural Networks Work

Neural Networks are good for learning lots of different types of patterns. To give an example of how is it work imagine you had a four pixel camera. So not four megapixels, but just four pixels. And it was only black and white. And you wanted to go around and take pictures of things and determine automatically then whether these pictures were solid, all white or all dark image, vertical line or a diagonal line or a horizontal line? This is tricky because you can’t do this with simple rules about the brightness of the pixels. Both of these are horizontal lines, but if you try to make a rule about which pixel was bright and which was dark you wouldn’t be able to do it. So to do this with the neural network you start by taking all of your inputs, in this case our four pixels, and you break them out into input neurons. You assign a number to each of these depending on the brightness or darkness of the pixel. +1 is all the way white. -1 is all the way black. And then gray is zero right in the middle. So these values once you have them broken out and listed like this on the input neurons, It’s also called the input vector or array. It’s just a list of numbers that represents your inputs right now. It’s a useful notion to think about the receptive field of a neuron. All this means is what set of inputs makes the value of this neuron as high as it can possibly be. For input neurons this is pretty easy. Each one is associated with just one pixel and when that pixel is all the way white the value of that input neuron is as high as it could go. The black and white checkered areas show pixels that an input neuron doesn’t care about. If they’re all the way white or all the way black it still doesn’t affect the value of that input on it all. Now to build a neural network we create a neuron. The first thing this does is it adds up all of the values of the input neurons. So in this case if we add up all of those values we get a 0.50 Now to complicate things just a little bit each of the connections are weighted meaning they’re multiplied by a number. That number can be 1 or -1 or anything in between. So for instance if something has a weight of -1 its multiplied and you get the negative of it. And that’s added in. If something has a weight of zero then it’s effectively ignored. So here’s what those weighted connections might look like. You’ll notice that after the values of the input neurons are weighted and added the values come… the final value is completely different. Graphically it’s convenient to represent these weights as white links being positive weights black links being negative weights and the thickness of the line is roughly proportional to the magnitude of the weight. Then after you add the weighted input neurons they get squashed and I’ll show you what that means. You have a sigmoid squashing function. Sigmoid just means S-shaped. And what this does is you put a value in, let’s say 0.5 And you run a vertical line up to your sigmoid and then a horizontal line over from where it crosses and then where that hits the Y-axis that’s the output of your function. So in this case slightly less than point five. It’s pretty close. As your input number gets larger output number also gets larger, but more slowly. And eventually, no matter how big the number you put in the answer is always less than one. Similarly when you go negative the answer is always greater than negative one. So this ensures that neurons value never gets outside of the range of +1 to -1 which is helpful for keeping the computations in the neural network bounded and stable. So after you sum the weighted values of neurons squash the result — you get the output. In this case 0.746 That is a neuron. So we can call this… we can collapse all that down and this is a neuron that does a weighted sum and squash the result. And now instead of just one of those assume you have a whole bunch there are 4 shown here, but there could be 400 or 4 million. Now to keep our picture clear we’ll assume for now that the weights are either +1, white lines, -1, black lines, or 0, which case they’re missing entirely. But in actuality all of these neurons that we created are each attached to all of the input neurons and they all have some weight between -1 and +1. When we create this first layer of our neural network the receptive fields get more complex. For instance here each of those end up combining two of our input neurons and so the value… the receptive field… the pixel values that make that first layer neuron, as large as it can possibly be, look now like pairs of pixels. Either all white or a mixture of white and black depending on the weights. So for instance this neuron here is attached to this input pixel (which is upper-left) and this input pixel (just lower left) and both of those weights are positive. So it combines the two of those. And that’s it’s receptive field. The receptive field of this one plus the receptive field of this one. However, if we look at this neuron it combines… this pixel upper right and this pixel lower right. It has a weight of minus one for the lower right pixel so that means it’s most active when this pixel is black, so here is its receptive field. Now because we were careful of how we created that first layer its values look a lot like input values. And we can turn right around and create another layer on top of it the exact same way with the output of one layer being the input to the next layer. And we can repeat this three times or seven times or 700 times for additional layers. Each time the receptive fields get even more complex. So you can see here using the same logic now they cover all of the pixels and more special arrangement of which are black and which are white. We can create another layer… Again all of these neurons in one layer are connected to all of the neurons in the previous layer. But we’re assuming here that most of those weights are zero and not shown. It’s not generally the case. So just to mix things up, we’ll create a new layer, but if you notice our squashing function isn’t there anymore. We have something new called a rectified linear unit. This is another popular neuron type. So you do your weighted sum of all your inputs and instead of squashing you do rectified linear units. You rectify it. So if it is negative you make the value 0. If it’s positive you keep the value. This is obviously very easy to compute and it turns out to have very nice stability properties for neural networks as well. In practice. So after we do this because some of our weights are positive and some are negative, connecting to those rectified linear units, we get receptive fields and their opposites. Look at the patterns there. And then finally when we’ve created as many layers with as many neurons as we want we create an output layer. Here we have four outputs that we’re interested in. Is the image solid, vertical, diagonal or horizontal. So to walk through an example here of how this would work, let’s say we start with this input image shown on the left. Dark pixels on top, white on the bottom. As we propagate that to our input layer this is what those values would look like. The top pixels, the bottom pixels. As we move that to our first layer we can see the combination of a dark pixel and a light pixel summed together get us zero — gray. Whereas down here we have the combination of a dark pixel plus a light pixel with a negative weight. So that gets us a value of negative one here. Which makes sense because if we look at the receptive field here… Upper left pixel white, lower left pixel black. This the exact opposite of the input that we’re getting and so we would expect its value to be as low as possible. Minus one. As we move to the next layer we see the same types of things. Combining zeros to get zeros. Combining a negative and a negative with a negative weight which makes a positive to get a zero. And here we have combining two negatives to get a negative. So again you’ll notice the receptive field of this is exactly the inverse of our input. So it makes sense that it’s weight would be negative. Or its value would be negative. And we move to the next layer. All of these, of course, these zeros propagate forward. Here… This is a negative has a negative value, and it gets… has a positive weight. So it just moves straight forward. Because we have a rectified linear unit, negative values become zero. So now it is zero again too. But this one gets rectified and becomes positive. Negative times a negative is positive. And so when we finally get to the output we can see they’re all zero except for this horizontal, which is positive, and that’s the answer. Our neural network said this is an image of a horizontal line. Now, neural networks usually aren’t that good, not that clean. So there’s a notion of with an input what is truth? In this case the truth is this has a zero for all of these values, but a one for horizontal. It’s not solid. It’s not vertical. It’s not diagonal. Yes, it is horizontal. An arbitrary neural network will give answers that are not exactly truth. It might be off by a little or a lot. And then the error is the magnitude of the difference between the truth and the answer given. And you can add all these up to get the total error for the neural network. So the idea… the whole idea with learning and training is to adjust the weights to make the error as low as possible. So the way this is done is… Put an image in we calculate the error at the end, then we look for how to adjust those weights higher or lower to either make that error go up or down. And we of course adjust the weights in the way than make the error go down. Now, the problem with doing this is each time we go back and calculate the error we have to multiply all of those weights by all of the neurons values at each layer. And we have to do that again and again once for each weight. This takes forever in computing terms on computing scale. And so it’s not a practical way to train a big neural network. You can imagine instead of just rolling down to the bottom of a simple valley we have a very high dimensional valley and we have to find our way down. And because there are so many dimensions one for each of these weights that the computation just becomes prohibitively expensive. Luckily there was an insight that lets us do this in a very reasonable time, and that’s that if we’re careful about how we design our neural network we can calculate this slope directly. The gradient. We can figure out the direction that we need to adjust the weight without going all the way back through our neural network and recalculating. So just review, the slope that we’re talking about is when we make a change in weight the error will change a little bit and that relation of the change in weight to the change in error is the slope. Mathematically there are several ways to write this. I will favor the one on the bottom. It’s technically most correct. We’ll call it DE/DW for shorthand. Every time you see it, just think: “The change in error when I change a weight” Or the change in the thing on the top when I change the thing on the bottom. This does get us into a little bit of calculus, we do take derivatives. It’s how we calculate slope. If it’s new to you I strongly recommend a good semester of calculus. Just because the concepts are so universal and a lot of them have very nice physical interpretations, which I find very appealing. But don’t worry otherwise just gloss over this and pay attention to the rest, and you’ll get a general sense for how this works. So in this case if we change the weight by +1, the error changes by -2, which gives us a slope of minus two. That tells us the direction that we should adjust our weight and how much we should adjust it to bring the error down. Now to do this you have to know what your error function is. So assume we had error function that was the square of the weight. And you can see that our weight is right at -1. So the first thing we do is we take the derivative, change in error divided by change in weight, DE/DW. The derivative of weight squared is two times the weight and so we plug in our weight of -1 and we get a slope DE/DW of minus two. Now the other trick that lets us do this with deep neural networks is chaining. And to show you how this works imagine a very simple trivial neural network with just one hidden layer, one input layer, one output layer and one weight connecting each of them. So it’s obvious to see that the value Y is just the value X times the weight connecting them. W1. So if we change W1 a little bit, we just take the derivative of Y with respect to W1, that we get X, the slope is X. If I change W1 by a little bit then Y will change by X times the size of that adjustment. Similarly for the next step. You can see that E is just the value Y times the weight W2. And so when we calculate DE/DY it’s just W2. Because this network is so simple we can calculate from one end to the other X times W1 times W2 is the error E. And so if we want to calculate how much will the error change if I change W1, we just take the derivative of that with respect to W1 and get X times W2. So this illustrates, you can see here now, that what we just calculated is actually the product of our first derivative that we took, DY/DW1, times the derivative for the next step, the DE/DY, multiplied together. This is chaining. You can calculate the slope of each tiny step and then multiply all of those together to get the slope of the full chain… the derivative of the full chain. So in a deeper neural network what this would look like is if I want to know how much the error will change if I adjust a weight that’s deep in the network, I just calculate the derivative of each tiny little step. All the way back to the weight that I’m trying to calculate. And then multiply them all together. This computationally is many many times cheaper than what we had to do before of recalculating the error for the whole neural network for every weight. Now, in the neural network that we’ve created there are several types of back propagation we have to do. There are several operations we have to do. For each one of those we have to be able to calculate the slope. So for the first one is just a weighted connection between two neurons A and B. So let’s assume we know the change in error with respect to B. We want to know the change in error with respect to A. To get there we need to know DB/DA. So to get that we just write the relationship between B and A. Take the derivative of B with respect to A you get the weight W and now we know how to make that step. We know how to do that little nugget of back propagation. Another element that we’ve seen is sums. All of our neurons sum up a lot of inputs. To take this back propagation step we do the same thing. We write our expression and then we take the derivative of our endpoint Z with respect to a step that we are propagating to A. And DZ/DA in this case is just 1. Which make sense if we have a sum of a whole bunch of elements, we increase one of those elements by one, we expect the sum to increase by one. That’s the definition of a slope of one. One-to-one relation there. Another element that we have, that we need to be able to back propagate is the sigmoid function. So this one’s a little bit more interesting mathematically. I’ll just write it shorthand like this, the sigma function. It is entirely feasible to go through and take the derivative of this analytically and calculate it. It just so happens that this function has a nice property that to get its derivative you just multiply it by one minus itself. So this is very straightforward to calculate. Another element that we’ve used is the rectified linear unit. Again to figure out how to back propagate this we just write out the relation B is equal to A if a is positive, otherwise it’s zero. And piecewise for each of those we take the derivative. So DB/DA is either one if a is positive or zero. And so with all of these little back propagation steps and the ability to chain them together we can calculate the effect of adjusting any given weight on the error for any given input. And so to train then. We start with a fully connected network. We don’t know what any of these weights should be. And so we assign them all random values. We create a completely arbitrary random neural network. We put in an input that we know the answer to. We know whether it’s solid, vertical, diagonal or horizontal. So we know what truth should be and so we can calculate the error. Then… we run it through calculate the error and using back propagation go through and adjust all of those weights a tiny bit in the right direction. And then we do that again with another input, and again with another input for… If we can get away with it many thousands or even millions of times. And eventually all of those weights will gravitate, they’ll roll down that many dimensional valley to a nice low spot in the bottom. Where it performs really well and does pretty close to truth on most of the images. If we’re really lucky it will look like what we started with. With intuitively… understandable receptive fields for those neurons and a relatively sparse representation meaning that most of the weights are small or close to zero. That doesn’t always turn out that way. But what we aren’t guaranteed is it’ll find a pretty good representation of… you know, the best that it can do adjusting those weights to get as close as possible to the right answer for all of the inputs. So what we’ve covered is just a very basic introduction to the principles behind neural networks. I haven’t told you quite enough to be able to go out and build one of your own. But if you’re feeling motivated to do so I highly encourage it. Here are a few resources that you’ll find useful. You’ll want to go and learn about biased neurons. Dropout is a useful training tool. There are several resources available from Andrej Karpathy who is an expert in neural networks and great at teaching about it. Also there’s a fantastic article called “The Black Magic of Deep Learning” that just has a bunch of practical “from the trenches” tips on how to get them working well. If you found this useful, I highly encourage you to visit my blog and check out several other “how it works” style posts. And the links for these slides you can get as well to use however you like. There’s also link to them down in the comment section. Thanks for listening.

100 thoughts on How Deep Neural Networks Work

  1. Still dont quiet get it… Lets say that I have network design: 2-3-3-2. How do I use chain rule here if I want to calculate weight next to input. Which path should I take, through what neurons… Or maybe I'm totally wrong…

  2. Can I ask a question? you weighted four pixel values and squashed result to got a value 0.746 at time 5:00 of video, but how can ONE value represents TWO pixels on 6:30 of video ?

  3. Why did you say that you can configure your network with as many neurons per layers as you want? because for the example of the 4 pixel square if you've added one more neuron in the 2nd layer (e.g.) he would have the same value and the weights of some other neuron in the same layer… Isn't there a point where adding neurons is useless?

  4. Isn't that tanh? I've seen 2 types of sigmoids:
    1. tanh(x)
    2. 1/(1+exp(-x))
    which should i use though?

    BTW what do you mean by error function? Is it used to precalculate the error, or to calculate the current one?
    (assuming it's pre-calculate because anyway you would use the |truth-answer| for calcing the current)

    Also may you show a video on how you should apply backpropagation to the weights?


    A machine learning algorithm demo purely in Python — without any other modules than math and random. Simplified, it still does the trick… results 1.0 (or as slightly above) soon matches up.

  6. Excellent explanation! Thank you for your hard work. Really enjoyed it. On another note, has anyone ever told you that you sound like Ryan Reynolds? It was like having Deadpool explain neural networks to me, minus the foul language. 🙂

  7. Any advice on when to use RELU and when to use sigmoid? I wrote one in C++ with all sigmoid… Wondered if you can go into more theory about why and which squashing function to use.

  8. i think at 4:45 the function is tanh rather than sigmoid function, because sigmoid function has limit on y axis from 1 to zero and tanh function has limit of y axis from 1 to -1

  9. I have one question. In the chaining example, you assumed 'e' as output. But while describing, you interpreted 'e' as error. Why is that? And also, is it possible to know the error function?
    Thanks in advance.

  10. Can domeone please make czech subtitles, I would really like to learn it, but there are no czech videos. 😦

  11. Excellent explanation with Simplistic example for understanding NN!! But one mistake suddenly gets corrected at 9:58

  12. thank you so much it's great work what you did sir. i hope you will continues in the right way to teach and share information , you r the best 😉

  13. I really wanted to know what it determined that grayish box to be… I was struggling with that the whole time, and was hoping for a percentage-based definitive assessment from the neural network! Did I miss something?

  14. Holy shit! Now I… I actually get it!
    Thank you!
    Clean, concise, informative, astonishingly helpful, you have my deepest gratitude.

  15. Obviously youd never use a neural net in practice to do this because there are literally two possible positive results that are easy to check. Or could use as a feature whether it's a checkerboard because ots so easy to compute for 2×2.

  16. i was amazed by the way you talk, and explain very slowly as well you remain slow until the end and you dont rush things. bravo

  17. If it's all one gradient descent, isn't a NN prone to getting stuck in a local minimum?

    Rephrased on another level: Does a NN try out different "theories" during training?

  18. you keep it going wich keeps giving me software to work with barrels that increase and reform, as they like, but not over the top, a barrel needs to stay a barrel man, or people will experience to much of a loss of their barrel and quit that shit wich gives reason to think about if its even worth it.
    u know what the problem is, the problem is how shit is spread, and how toiletpaper is accessable. if i dont need it, i have to eat it, if i need it, i had no bock to buy some.
    do you understand what i mean. quit your yt and join me making content for new barrel meta.

    do 80×0.1 x20down, 80×0.2 x20 down 2nd layer, 80×0.3 x20 3rd, 80×0.3 x20 4th, *x* x** 5th,*x* x** 6th,*x* x** 7th,*x* x** 8th, or wait, man hang on hang on i got this.

    this is the mark to you use to get the data out of 1/4th of 1/24th if im right. i think i am, if not, just corrigate your numbers, its not the positions that defines where you get the paper, its how you can bring that network into a swing to controll and change is purpose. some people use it to enjoy barrels, i use it to get around the girpapers. okok i use it to get paper and melon to be able to have paper for the shit, without getting konfronted with shit first to get hyped up because situations is not so guud..
    008 016 024 032 040 048 056 064 072 080 088 096 // ´ <- that means the top right part of the 2×2 block of the bit.
    008 016 024 032 040 048 056 064 072 080 088 096
    008 016 024 032 040 048 056 064 072 080 088 096
    008 016 024 032 040 048 056 064 072 080 088 096
    008 016 024 032 040 048 056 064 072 080 088 096
    008 016 024 032 040 048 056 064 072 080 088 096
    008 016 024 032 040 048 056 064 072 080 088 096
    008 016 024 032 040 048 056 064 072 080 088 096
    008 016 024 032 040 048 056 064 072 080 088 096
    008 016 024 032 040 048 056 064 072 080 088 096
    008 016 024 032 040 048 056 064 072 080 088 096
    008 016 024 032 040 048 056 064 072 080 088 096
    thats a quarter i think is 0.25, this is a quarter of a bit u know. or something around that idk, im not in fucking checking if it equals a 2 wih 2x1s if i change the space oouhkaiii!!
    but still keep them quarterd, never change the space of the number of a line, to go higher or under the number of the line left of right of it ohkayyy!!!!!. we can go from red into green directly, if u want to green, you need to have give impulse to the yellow until it reaches the green. if u like to have more green. keep the consistencie in its state until u modify it too much or u get lost in shit that is going on because you lost line!! and if u dont have the line you wont find the paper we need.

    create a bit and use the numbers as mirror of the state of the bit. ouhkaayy!!! we still 2d btw. its enaugh man. going 3d into the bits cracks up the barrels back to cubes. but we need the dots on the dices you know. to show the people that it is close to impossible to roll 6 times in a row the number 6 in a day for someone. if they dont get it, that they dont even go for an hour with notes, we have more proove to shut down the addicted barrel cracks to get more paper for the shit u knoo..
    a 3x3x3
    simple like 1234x1234x1234x1234 // 1=
    or 001x001x001x001 or something, idk have no brain atm, i want to finally create water mellons rake in, pause the time and drop my load before the hole gets to warm. – [ ] and here must come i think also a number to give addition states to atleast get to the banana. if we dont have some modify tricks, we stay to close to the simple defining content of the meta. for more paper we need more meta.

    u understand it man theres only 1 way that makes sense for

    just build a "rubiks cube" with 3x3x3 and split every cube into 4 squares of 12×12 state variable storagepart.
    and then connect your emotions or skills or what ever for properties to a place in the 3 dimensional information network you builded.
    go through your days and make your visual concept of an 3 dimensional "storage" thing be handled, like, you need to get the feeling for it and connect it with your observations.

    if u get a feeling how you can handle that cube, how the states of its storageparts changes/how the organ behaves generally.
    theeen. connect what u want to shut out. shut it out, grind asking girls for the time or if they can read the time on your phone because u have forgotten your glasses, and then, when you have build a network, let the other parts activate again. and voila. it will be harder to not talk to girls instead of talking to girls if u handled it guud.

    and forget what the trolls told you about this formulas and shit. it gets you barrels until your end.
    if u got enaugh paper you get even 2 girls.
    just implement a new technique you make your yoyo. u understand that.
    if u know how to throw the yoyo, it spins out everything in your way and gets back without waiting to get shitspanned before you make it to the melons wich lets you drop the loads to crush the things infront of you. if u get greedy and want too much yoyo's you get into a shit loadout from 1/30th of your country. u know, when there are no shitholes above you, you think everythig is fine and free to grab the paper to keep it clean completely u know. if that happens, you gonna need sooou much paper, that you dont even have to go anymore for a yoyo or somethin. because shit cant be crushed, you need paper not a yoyo. so dont make my mistake and get greedy. just. give a half fuck. and take the half. being in fuck depths is fucking up your cube man.

  19. Below is an awesome article which will take a brief look at Deep Learning with Java. Also will show how to build simple neural network using Open-Source, Distributed, Deep Learning Library for the JVM Deeplearning4j or DL4J. Complete Source Code is also available on GitHub
    Hope you find this useful !

  20. Brilliant! I was involved 50 years ago in a very early AI project and was exposed to simple neural nets back then. Of course, having no need for neural nets, I forgot most of what I ever knew about them during the interval. And, wow, has the field expanded since then. You have given a very clear and accessible explanation of deep networks and their workings. Will happily subscribe and hope to find further edification on Reinforcement Learning from you. THANK YOU.

  21. 30 years ago, I studied computer science. we were into pattern-recognition and stuff, and I was always interested in learning machines, but couldn't get the underlying principle. now, I got it. that was simply brilliant. thanks a lot.

  22. "Now I'll demonstrate you that there's nothing mysterious about neural networks and that it's quite an understandable idea"
    Starts drawing pentagrams.

  23. Thank you Brandon for taking the time to explain the logic behind neural networks. You have given me enough information to take the next steps towards building one of my own… and thank you YouTube algo for bringing this video to my attention.

  24. briliant video, I've subed already. I have question though, is the last neuron of the third layer conection are right (7:40) if i understood it should be both "white" conections?

  25. Watched it many times but still confused. (1) Why does the 4 pixel start with shaded grey but then the example (from 09:38) use a black and white? (2) At 03:01, how come the weights change from 1.0s to -0.2, 0.0, 0.8, -0.5? Where are these values from?

  26. You got confusing fault on your diagram: in 3 layer the bottom neuron you got negative inputs so the image should be exactly opposite black on the top and white on the bottom. However in 10:09 the inputs only for this neuron magically changed to white and rest of the process going OK.Explanation of the subject very nice – congratulation.

  27. I would have liked to see how the network handled as input an L-shape, so that everything is black except the top-right pixel

  28. 23:36 "I haven't told you quite enough to build [a neural network] of your own". Yeah, right. I watched this video, because I wanted to start coding afterwards. You should have told this at the beginning, and not at the end of this video and put the "further reading" section first, not last.

  29. really dumb and super complicated explanation of simple things. This is how this world of morons is working. Any dumb loser wants to make himself/herself/itself smart and others dumb with hope they will never become better than him/her/it.

  30. I believe the bottom scenario in the 2nd layer should be the opposite, as the 2 input weights are both negative, as so the outcome should be flipped. Great video nonetheless, well done

  31. Thanks for your excellent explanation! At 4:05 it seems you really use tanh, not sigmoid/the logistic function, since sigmoid goes from 0.0 to 1.0, but your squashing function goes from -1.0 to 1.0.
    Especially, if you take your definition at 20:26, it confirms you are using a sigmoid/the logistic function going from 0.0 to 1.0, which does not work out with your example network.
    It's a minor detail, but I tripped upon it when I tried to actually compute your example step by step. Maybe worth pointing out with a note in the video? Thanks again!

  32. Thank you for your video. One question: Is there also a name for you neural network model in the video? Is it a perceptron model?

  33. Great video. A single clear, concrete example is more useful than 100 articles full of abstract equations and brushed-over details. Speaking as someone who's read 100 articles full of abstract equations and brushed-over details.

Leave a Reply

Your email address will not be published. Required fields are marked *