How to Design a Convolutional Neural Network | Lecture 8

Articles

How to Design a Convolutional Neural Network | Lecture 8


One of the questions that I get frequently
is ‘how do you design a neural network’ or more specifically ‘how do you know how many
layers you need to have’ or ‘how do you know what’s the right value for a particular hyperparameter’. First things first, if you are not familiar
with convolutional neural networks, you can find the link to my introductory video in
the description below. When people say the best thing about deep
learning is that it requires no hand-designed feature extractors, everything is learned
from data, and there’s almost no human intervention, that’s not entirely true. Indeed, the features are learned from data,
and that’s great. A hierarchy of learned features can lead to
a great representational power. But there’s still a lot of human intervention
in the model design although there have been some efforts to automate the model selection
process. I think eventually we will get to a point
where no human intervention is required but it seems like humans will still be in the
loop for a while. You might ask why not just do a grid search
on all hyperparameters and automatically pick the best configuration? Wouldn’t that be more systematic? Well, a complete grid search is usually not
a feasible option since there are too many hyperparameters. Furthermore, model selection in deep models
is not just about choosing the number of layers and hidden units and a few other hyperparameters. Designing the architecture of a model also
involves choosing the types of layers and the way they are arranged and connected to
each other. So there are infinitely many ways one can
design a network. Designing a good model usually involves a
lot of trial and error. It is still more of an art than science, and
people have their own ways of designing models. So the tricks and design patterns that I will
be presenting in this video will be mostly based on ‘folk wisdom’, my personal experience
with designing models, and ideas that come from successful model architectures. Back to our question “how do you design a
neural network?” The short answer is: you don’t. The easiest thing you can do is pick something
that has been proven to work for a similar problem and train it for your task. You don’t even have to train it from scratch. You can take a model that has already been
trained on some large dataset and fine tune the weights to adapt it to your problem. This is called transfer learning and we will
come back to that later. This approach works in many practical cases,
but not applicable in all cases especially if you are working on a novel problem or doing
bleeding edge research. Even if you are working on a novel problem
or the existing models don’t meet your needs, that doesn’t mean that you need to reinvent
the wheel. You can always borrow ideas from successful
models to design your own model. We will discuss some of these ideas in this
video. Let’s go through frequently asked questions
about designing a convolutional neural network. First question: how do you choose the number
of layers and number of units per layer? My experience is that beginning with a very
small model and gradually increasing the model size usually works well. And by increasing the model size, I mean adding
layers and increasing the number of units per layer. You could also go the other way around and
start with a big model and keep shrinking it. The problem with that is it’s hard to decide
how big you should start. If you want to start small you always have
a point zero, which is the linear regression. That doesn’t mean that you should always try
linear regression first even if it’s obvious that there is no linear mapping between the
inputs and the outputs and the problem is not linear. But, overall it usually has more benefits
to start smaller and increase the model capacity until the validation error stops improving. Earlier, I made a separate video about how
to choose model capacity. You can find it in the Deep Learning Crash
Course playlist to learn more about it. You might wonder given the same number of
trainable parameters whether it’s better to have more layers or more units per layer. It’s usually better to go deeper than wider,
so I would opt in for a deeper model. However, a very tall and skinny network can
be hard to optimize. One way to make training deep models easier
is to add skip connections that connect non-consecutive layers. A well-known model architecture, called ResNet,
uses blocks with this type of shortcut connections. Using such connections gives the following
layers a reference point so that adding more layers won’t worsen the performance. The skip connections also create an additional
path for the gradient to flow back more easily. This makes it easier to optimize the earlier
layers. Using skip connections is a common pattern
in neural network design. Different models may use skip connections
for different purposes. For example, fully convolutional networks
use skip connections to combine the information from deep and shallow layers to produce pixel-wise
segmentation maps. A paper that I have published last year proposed
using both types of skip connections to segment remotely sensed multispectral imagery. The skip connections on the left help recover
fine spatial information discarded by the coarse layers while preserving coarse structures. The skip connections on the right provide
access to previous layer activations at each layer, making it possible to reuse features
from previous layers. Let’s move on to the second question: how
do you decide on the size of the kernels in the convolutional layers? Short answer: 3×3 and 1×1 kernels usually
work the best. They might sound too small, but you can stack
3×3 kernels on top of each other to achieve a larger receptive field as I mentioned in
the previous video. How about 1×1 kernels? Isn’t a 1×1 filter just a scalar? First, a 1×1 filter isn’t really a 1×1 filter. The size of a kernel usually refers to its
spatial dimensions. So a 1×1 filter is, in fact, a 1x1xN filter
where N is the number of input channels. You can think of them as channel-wise dense
layers that learn cross-channel features. Obviously, 1×1 filters don’t learn spatial
features and stacking 1×1 filters alone wouldn’t increase the receptive field, but combined
with 3×3 filters they can help build very efficient models. This pattern is at the heart of many convolutional
neural network architectures, including Network in Network, Inception family models, and MobileNets. One advantage of 1×1 convolutions is that
they can be used for dimensionality reduction. For example, if the input volume is 32x32x256
and we use 64 of 1×1 units then the output volume would be 32x32x64. Doing so reduces the number of channels before
its fed into the next layer. Let’s say the output is fed into a 3×3 convolutional
layer with 128 filters and compute the number of operations that we need to do to compute
these convolutions. To compute the output of the 1×1 filter we
need to compute the values for each one of 32x32x64 pixels, and we need to do 1x1x256
operations, which is the size of the filter, to compute each value. We do the same thing to compute the activations
of the following 3×3 convolutional layer which sums up to roughly 92 million operations. Now, if we remove the 1×1 layer and compute
the number of operations we end up with over 300 million operations. It may sound a little counterintuitive at
first but adding 1×1 convolutions to a network can greatly improve the computational efficiency. Another use of pointwise convolutions is to
implement a depthwise separable convolution, which reduces the number of parameters. The idea is simple, perform a spatial convolution
on each channel in the input volume separately, then use a pointwise convolution to learn
cross-channel features. Let’s take the previous example with the traditional
convolutional layer. We had 128 units, each had 3x3x256 parameters,
where 256 is the number of channels in the input volume. So, in total, this layer had roughly 300,000
parameters. Alternatively, we could use 256 filters each
only applied to one channel, separately. So, the units in the first layer would have
3x3x1 parameters instead of 3x3x256, since each unit acts on only a single channel. Then, we can use a pointwise layer to learn
cross-channel features and get the same output volume. This would lead to about 35,000 trainable
parameters, spatial and pointwise layers combined. This is the main idea behind the recently
popularized MobileNet architecture. By stacking depthwise separable convolutional
blocks MobileNet manages to be very small and efficient without sacrificing too much
accuracy. Separable convolution is not a new concept. For example, in image processing, it’s a common
practice to separate a 2-dimensional filter into 1-dimensional row and column filters
and apply them separately to reduce the computational cost. So we can take the depthwise separable convolution
idea one step further and stack 1×3, 3×1, and 1×1 filters on top of each other to learn
row-wise, column-wise, and depthwise separable filters. Actually, I tried this several years ago,
but it turns out that the savings from the spatially separable filters are not worth
the accuracy that is sacrificed since the filters are already small spatially. So it seems like depthwise-only filter separation
is a good compromise. Next question: how to choose the sliding window
step size, also known as the stride? Choose 1 if you want to preserve the spatial
resolution of the activations, choose 2 if you want to downsample and don’t want to use
pooling. If you want to upsample the activations use
a fractional stride such as 1/2, which is similar to a stride of 2 but has its input
and output reversed. A convolution with a fractional stride is
sometimes called a transposed convolution or a deconvolution, although using the term
‘deconvolution’ is a little misleading from a mathematical perspective. How about pooling parameters? Max pooling with same padding and a pooling
size of 2×2 usually works fine. If you want your model to handle variable-sized
inputs and your output is fixed-size you might consider pooling to a fixed size or using
global average pooling. For example, if your inputs are images having
different dimensions and your output is a single class label, then you can take the
mean of the activations before the fully-connected layers. How to choose the type of activation functions? Short answer: choose ReLU except for the output
layer. Long answer: check out my earlier video on
artificial neural networks. It’s actually a short video. So, I should have said not so long answer. What type of regularization techniques should
I use? Short answer: use L2 weight decay and dropout
between the fully connected layers if there are any. Not so long answer: check out my earlier video
on regularization. What should be the batch size? A batch size of 32 usually works fine for
image recognition tasks. If the gradient is too noisy you might try
a bigger batch size. If you feel like the optimization gets stuck
in local minima or if you run out of memory, then a smaller batch size would work better. These are the hyperparameters and design patterns
that I can think of right now. The next video will be about transfer learning. Feel free to ask questions in the comments
section and subscribe to my channel for more videos if you like. As always, thanks for watching, stay tuned,
and see you next time.

Leave a Reply

Your email address will not be published. Required fields are marked *