One of the questions that I get frequently

is ‘how do you design a neural network’ or more specifically ‘how do you know how many

layers you need to have’ or ‘how do you know what’s the right value for a particular hyperparameter’. First things first, if you are not familiar

with convolutional neural networks, you can find the link to my introductory video in

the description below. When people say the best thing about deep

learning is that it requires no hand-designed feature extractors, everything is learned

from data, and there’s almost no human intervention, that’s not entirely true. Indeed, the features are learned from data,

and that’s great. A hierarchy of learned features can lead to

a great representational power. But there’s still a lot of human intervention

in the model design although there have been some efforts to automate the model selection

process. I think eventually we will get to a point

where no human intervention is required but it seems like humans will still be in the

loop for a while. You might ask why not just do a grid search

on all hyperparameters and automatically pick the best configuration? Wouldn’t that be more systematic? Well, a complete grid search is usually not

a feasible option since there are too many hyperparameters. Furthermore, model selection in deep models

is not just about choosing the number of layers and hidden units and a few other hyperparameters. Designing the architecture of a model also

involves choosing the types of layers and the way they are arranged and connected to

each other. So there are infinitely many ways one can

design a network. Designing a good model usually involves a

lot of trial and error. It is still more of an art than science, and

people have their own ways of designing models. So the tricks and design patterns that I will

be presenting in this video will be mostly based on ‘folk wisdom’, my personal experience

with designing models, and ideas that come from successful model architectures. Back to our question “how do you design a

neural network?” The short answer is: you don’t. The easiest thing you can do is pick something

that has been proven to work for a similar problem and train it for your task. You don’t even have to train it from scratch. You can take a model that has already been

trained on some large dataset and fine tune the weights to adapt it to your problem. This is called transfer learning and we will

come back to that later. This approach works in many practical cases,

but not applicable in all cases especially if you are working on a novel problem or doing

bleeding edge research. Even if you are working on a novel problem

or the existing models don’t meet your needs, that doesn’t mean that you need to reinvent

the wheel. You can always borrow ideas from successful

models to design your own model. We will discuss some of these ideas in this

video. Let’s go through frequently asked questions

about designing a convolutional neural network. First question: how do you choose the number

of layers and number of units per layer? My experience is that beginning with a very

small model and gradually increasing the model size usually works well. And by increasing the model size, I mean adding

layers and increasing the number of units per layer. You could also go the other way around and

start with a big model and keep shrinking it. The problem with that is it’s hard to decide

how big you should start. If you want to start small you always have

a point zero, which is the linear regression. That doesn’t mean that you should always try

linear regression first even if it’s obvious that there is no linear mapping between the

inputs and the outputs and the problem is not linear. But, overall it usually has more benefits

to start smaller and increase the model capacity until the validation error stops improving. Earlier, I made a separate video about how

to choose model capacity. You can find it in the Deep Learning Crash

Course playlist to learn more about it. You might wonder given the same number of

trainable parameters whether it’s better to have more layers or more units per layer. It’s usually better to go deeper than wider,

so I would opt in for a deeper model. However, a very tall and skinny network can

be hard to optimize. One way to make training deep models easier

is to add skip connections that connect non-consecutive layers. A well-known model architecture, called ResNet,

uses blocks with this type of shortcut connections. Using such connections gives the following

layers a reference point so that adding more layers won’t worsen the performance. The skip connections also create an additional

path for the gradient to flow back more easily. This makes it easier to optimize the earlier

layers. Using skip connections is a common pattern

in neural network design. Different models may use skip connections

for different purposes. For example, fully convolutional networks

use skip connections to combine the information from deep and shallow layers to produce pixel-wise

segmentation maps. A paper that I have published last year proposed

using both types of skip connections to segment remotely sensed multispectral imagery. The skip connections on the left help recover

fine spatial information discarded by the coarse layers while preserving coarse structures. The skip connections on the right provide

access to previous layer activations at each layer, making it possible to reuse features

from previous layers. Let’s move on to the second question: how

do you decide on the size of the kernels in the convolutional layers? Short answer: 3×3 and 1×1 kernels usually

work the best. They might sound too small, but you can stack

3×3 kernels on top of each other to achieve a larger receptive field as I mentioned in

the previous video. How about 1×1 kernels? Isn’t a 1×1 filter just a scalar? First, a 1×1 filter isn’t really a 1×1 filter. The size of a kernel usually refers to its

spatial dimensions. So a 1×1 filter is, in fact, a 1x1xN filter

where N is the number of input channels. You can think of them as channel-wise dense

layers that learn cross-channel features. Obviously, 1×1 filters don’t learn spatial

features and stacking 1×1 filters alone wouldn’t increase the receptive field, but combined

with 3×3 filters they can help build very efficient models. This pattern is at the heart of many convolutional

neural network architectures, including Network in Network, Inception family models, and MobileNets. One advantage of 1×1 convolutions is that

they can be used for dimensionality reduction. For example, if the input volume is 32x32x256

and we use 64 of 1×1 units then the output volume would be 32x32x64. Doing so reduces the number of channels before

its fed into the next layer. Let’s say the output is fed into a 3×3 convolutional

layer with 128 filters and compute the number of operations that we need to do to compute

these convolutions. To compute the output of the 1×1 filter we

need to compute the values for each one of 32x32x64 pixels, and we need to do 1x1x256

operations, which is the size of the filter, to compute each value. We do the same thing to compute the activations

of the following 3×3 convolutional layer which sums up to roughly 92 million operations. Now, if we remove the 1×1 layer and compute

the number of operations we end up with over 300 million operations. It may sound a little counterintuitive at

first but adding 1×1 convolutions to a network can greatly improve the computational efficiency. Another use of pointwise convolutions is to

implement a depthwise separable convolution, which reduces the number of parameters. The idea is simple, perform a spatial convolution

on each channel in the input volume separately, then use a pointwise convolution to learn

cross-channel features. Let’s take the previous example with the traditional

convolutional layer. We had 128 units, each had 3x3x256 parameters,

where 256 is the number of channels in the input volume. So, in total, this layer had roughly 300,000

parameters. Alternatively, we could use 256 filters each

only applied to one channel, separately. So, the units in the first layer would have

3x3x1 parameters instead of 3x3x256, since each unit acts on only a single channel. Then, we can use a pointwise layer to learn

cross-channel features and get the same output volume. This would lead to about 35,000 trainable

parameters, spatial and pointwise layers combined. This is the main idea behind the recently

popularized MobileNet architecture. By stacking depthwise separable convolutional

blocks MobileNet manages to be very small and efficient without sacrificing too much

accuracy. Separable convolution is not a new concept. For example, in image processing, it’s a common

practice to separate a 2-dimensional filter into 1-dimensional row and column filters

and apply them separately to reduce the computational cost. So we can take the depthwise separable convolution

idea one step further and stack 1×3, 3×1, and 1×1 filters on top of each other to learn

row-wise, column-wise, and depthwise separable filters. Actually, I tried this several years ago,

but it turns out that the savings from the spatially separable filters are not worth

the accuracy that is sacrificed since the filters are already small spatially. So it seems like depthwise-only filter separation

is a good compromise. Next question: how to choose the sliding window

step size, also known as the stride? Choose 1 if you want to preserve the spatial

resolution of the activations, choose 2 if you want to downsample and don’t want to use

pooling. If you want to upsample the activations use

a fractional stride such as 1/2, which is similar to a stride of 2 but has its input

and output reversed. A convolution with a fractional stride is

sometimes called a transposed convolution or a deconvolution, although using the term

‘deconvolution’ is a little misleading from a mathematical perspective. How about pooling parameters? Max pooling with same padding and a pooling

size of 2×2 usually works fine. If you want your model to handle variable-sized

inputs and your output is fixed-size you might consider pooling to a fixed size or using

global average pooling. For example, if your inputs are images having

different dimensions and your output is a single class label, then you can take the

mean of the activations before the fully-connected layers. How to choose the type of activation functions? Short answer: choose ReLU except for the output

layer. Long answer: check out my earlier video on

artificial neural networks. It’s actually a short video. So, I should have said not so long answer. What type of regularization techniques should

I use? Short answer: use L2 weight decay and dropout

between the fully connected layers if there are any. Not so long answer: check out my earlier video

on regularization. What should be the batch size? A batch size of 32 usually works fine for

image recognition tasks. If the gradient is too noisy you might try

a bigger batch size. If you feel like the optimization gets stuck

in local minima or if you run out of memory, then a smaller batch size would work better. These are the hyperparameters and design patterns

that I can think of right now. The next video will be about transfer learning. Feel free to ask questions in the comments

section and subscribe to my channel for more videos if you like. As always, thanks for watching, stay tuned,

and see you next time.