Convolution Fundamental I

Foundations of CNNs

Learning to implement the foundational layers of CNN's (pooling,convolutions) and to stack them properly in a deep network to solve multi-class image classification problems.

Computer vision

Computer vision is from the applications that are rapidly active thanks to deep learning

One of the applications of computer vision that are using deep learning includes:

Self driving cars

Face recognition

Deep learning also is making new arts to be created to in computer vision as we will see.

Rabid changes to computer vison are making new applications that weren't possible a few years ago.

Computer vison deep leraning techniques are always evolving making a new architectures which can help us in other areas other than computer vision

For example, Andrew Ng took some ideas of computer vision and applied it in speech recognition

Examples of a computer vision problems includes:

Image classification

Object detection

Detect object and localize them

Neural style transfer

Changes the style of an image using another image.

On of the challenges of computer vision problem that images can be so large and we want a fast and accurate algorithm to work with that.

For example,a 1000*1000 image will represent 3 million feature/input to the full connected neural network. If the following hidden layer contains 1000, Then we we will want to learn weight of the shape [1000,3 million] which is 3 billion parameter only in the first layer and that's so computationally expensive!

On of the solutions is to build this using convolution layers instead of the fully connected layers.

Edge detection example

The convolution operation is one of the fundamentals blocks of a CNN. One of the examples about convolution is the image edge detection operation.

Early layers of CNN might detect edges then the middle layers will detect parts of objects and the layers will put the these parts together to produce an output

In an image we can detect vertical edges, horizontal edges,or full edge detector

Vertical edge detection

An example of convolution operation to detect vertical edges:

In the last exmple a 6*6 matrix convolved with 3*3 filter/kernel gives us a 4*4 matrix

If you make the convolution operation in TensorFlow you will fin the function tf.nn.conv2d. In keras you will fin conv2d function.

The vertical edge detection filter will find a 3*3 place in an image where there are a bright region followed by a dark region

If we applied this filter to a white region followed by a dark region,it should find the edges in between the two colors as a positive value. But if we applied the same filter to a dark region followed by a white region it will give us negative values. To solve this we can use the abs function to make it positive.

Horizontal edge detection

Filter would be like this

There are a lot of ways we can put number inside the horizontal of vertical edge detections. For example here are the vertical Sobel filter(The idea is taking care of the middle row)

Also something called Scharr filter(The idea is taking great care of the middle row)

What we learned in the deep learning is that we don't need to hand craft these numbers, we can treat them as weights and then learn them. It can learn horizontal, vertical ,angled, or any edge type automatically ranther than getting them by hand.

Padding

In order to use deep neural networks we really need to use paddings

In the last section we saw that a 6*6 matrix convolved with 3*3 filter/kernel gives us a 4*4 matrix.

To give it a general rule, if a matrix n*n is convolved with f*f filter/kernel give us n-f+1,n-f+1 matrix.

The convolution operation shrinks the matrix if f>1

We want to apply convolution operation multiple times, but if the image shrinks we will lose a lot of data on this process. Also the edges pixels are uses less than other pixels in an image.

So the problems with convolutions are:

shrinks output

throwing away a lot of information that are in the edges.

To solve these problems we can pad the input image before convolution by adding some rows and columns to it. We will call the padding amount p the number of row/columns that we will insert in top, bottom , left and right of the image.

In almost all the cases the padding values are zeros

The general rule now, if a matrix n*n is convolved with f*f filter/kernel and padding p give us n+2p-f+1,n+2p-f+1 matrix

If n=6,f=3, and p=1 The n the output image will have n+2p-f+1=6+2-3+1=6. We maintain the size of the image.

Same convolutions is a convolution with a pad so that output size is the same as the input size. Its given by the equation:

In computer vision f is usually odd. Some of the reasons is that its have a center value.

Strided convolution

Strided convolution is another piece that are used in CNNs

We will call stride s

When we are making the convolution operation we used s to tell us the number of pixels we will jump when we are convolving filter/kernel. The last examples we described s was 1

Now the general rule are:

if a matrix n*n is convolved with f*f filter/kernel and padding p and stride s it give us (n+2p-f)/s+1, (n+2p-f)/s+1 matrix

In case (n+2p-f)/s+1 is fraction we can take floor of this value.

In math textbooks the conv operation is filpping the filter before using it. What we were doing is called cross-correlation operation but the state of art of deep learning is using this as conv operation.

Same convolutions is a convolution with a pad so that output size is the same as the input size. Its given by the equation:

Convolution over volumes

We see how convolution works with 2D images, now lets see if we want to convolve 3D iamges(RGB image)

We will convolve an image of height,width,# of channels with a filter of a height, width,same # of cahnnels. Hint hat the image number channels and the filter number of channels are the same.

We can call this as stacked filters for each channel!

Example

input image: 6*6*3

Filter:3*3*3

Result image: 4*4*1

In the last result p=0,s=1

Hint the output here is only 2D

We can use multiple filters to detect multiple features or edges. Example

Input image: 6*6*3

10 Filters: 3*3*3

Result image: 4*4*10

In the last result p=0,s=1

One Layer of a Convolutional Network

First we convolve some filters to a given input and then add a bias to each filter output and then get RELU of the result. Example:

Input iamge: 6*6*3 # a0

10 Filters: 3*3*3 #w1

Result image: 4*4*10 #w1a0

Add b(bias) with 10*1 will get us: 4*4*10 image #w1a0+b

Apply RELU will get us: 4*4*10 image #A1=RELU(w1a0+b)

In the last result p=0,s=1

Hint number of parameters here are; (3*3*3*10)+10=280

The last example forms a layer in the CNN

Hint that no matter how the size of the input, the number of the parameters for the same filter will still the same. That makes it less prune to overfitting

Here are some notation we will use. If layer l is a conv layer:

A simple convolution network example

Lets build a big example.

Input Image are: a0=39*39*3

n0=39 and nc0=3

First layer(Conv layer):

f1=3,s1=1,and p1=0

number of filters=10

Then output are a1=37*37*10

n1=37 and nc1-10

second layer(Conv layer):

f2=5,s2=2,p2=0

number of filters=20

The output are a2=17*17*20

n2=17,nc2=20

Hint shrinking goes much faster because the stride is 2

Third layer(Conv layer):

f3=5,s3=2,p2=0

number of filters=40

The output are a3=7*7*40

n3=7,nc3=40

Forth layer(Fully connected softmax)

a3=7*7*40=1960 as a vector.

In the last example you seen that the image are getting smaller after each layer and that's the tread now.

Typesof layer in a convolutional network:

Convolution. #Conv

Pooling #Pool

Fully connected #FC

Pooling layers

Other than the conv layers,CNNs often uses pooling layers to reduce the size of the inputs, speed up computation, and to make some of the features it detects more robust.

Max pooling example:

This example has f=2,s=2 and p=0 hyperparameters

The max pooling is saying, if the feature is detected anywhere in this filter then keep a high number. But the main reason why people are using pooling because its works well in practice and reduce computations.

Max pooling has no parameters to leran

Example of Max pooling on 3D input:

Input: 4*4*10

Max pooling size=2 and stride=2

output 2*2*10

Average pooling is taking the averages of the values instead of taking the max values

Max pooling is used more often than average pooling in practice.

If stride of pooling equals the size, it will then apply the effect of shrinking.

Hyperparameters summary

f: filter size

s: stride

Padding are rarely uses here

Max or average pooling

Convolutional neural network example

Now we will deal with a full CNN example. This example is something like the LeNet-5 that was invented by Yann Lecun

Input image are: a2=32*32*3

n0=32 and nc0=3

First layer(Conv layer): #Conv1

f1=5,s1=1,and p1=0

number of filters=6

Then output are a1=28*28*6

n1=28,and nc1=6

Then apply(Max pooling): #Pool1

f1p=2 and s1p=2

The output are a1=14*14*16

Second layer(Conv layer):#Conv2

f2=5,s2=1,p2=0

number of filters=16

The output are a2=10*10*16

n2=10,nc2=16

Then apply(Max pooling):#pool2

f1p=2,and s1p=2

The output are a2=5*5*16

Third layer(Fully connected) #FC3

Number of neurous are 120

The output a3=120*1, 400 came from 5*5*16

Forth layer(Full connected) #FC4

Number of neurons are 84

The output a4=84*1

Fifth layer(Softmax)

Number of neurons is 10 if we need to identify for example the 10 digits

Hint a Conv1 and Pool1 is treated as one layer

Some statistics about the last example:

Hyperparameters are a lot. For choosing the value of each you should follow the guideline that we will discuss later or check the literature and takes some ideas and numbers from it.

Usually the input size decrease over layers while the number of filters incerease

A CNN usually consists of one or more convolution(Not just one as the shown examples) folowed by a pooling.

Fully connected layers has the most parameters in the network

To consider using these bolocks together you should look at other working examples firsts to get some intuitions

Why convollutions?

Two main advantages of Convs are:

Parameter sharing.

A feature detector(such as a vertical edge detector) that's useful in one part of the

image is probably useful in another part of the image

sparsity of connection.

In each layer, each output value depends only on a small number of inputs which

makes it translation

Putting it all together

Deep convolutional models: case studies

Learn about the practical tricks and methods used in deep CNNs straight from the research paper.

Why look at case studies?

We learned about Conv layer, pooling layer, and fully connected layers. It turns out that computer vision researchers spent the past few years on how to put these layers together.

To get some intuitions you have to see the examples that has been made.

Some neural networks architecture that works well in some tasks can also work well in other tasks.

Here are some classical CNN networks:

LeNet-5

AlexNet

VGG

The best CNN architecture that won the last ImageNet competition is called ResNet and it has 152 layers!

There are also an architecture called Inception that was made by Google that are very useful and apply to your tasks.

Reading and trying the mentioned models can boost you and give you a lot of ideas to solve your task.

Classic networks

In this section we will talk about classic networks which are LeNet-5,AlexNet, and VGG

LeNet-5

The goal for this model was to identify handwritten digits in a 32*32*1 gray image. Here are the drawing of it:

This model was published in 1998. The last layer wasn't using softmax back then

It has 60K parameters.

The dimensions of the image decreases as the number of channel s increases.

ConvèPoolèConvèPoolèFCèFCèsoftmax this type of arrangement is quite common.

The activation function used in the paper was Sigmoid and Tanh. Modern implementation uses RELU in most of the cases.

[LeCun et al., 1998. Gradient-based learning applied to document recognition]

AlexNet

Named after Alex Krizhevsky who was the first author of this paper. The other authors includes Jeoffery Hinton.

The goal for the model was the ImageNet challenge which classifies images into 1000 classes. Here are the drawing of the model:

Summary:

ConvèMax-poolèConvèMax-poolèConvèConvèConvèMax-poolèFlattenèFCèFCèSoftmax

Similar to LeNet-5 but bigger.

Has 60 Million parameter compared to 60K parameter of LeNet-5

It used the RELU activation function.

The original paper contains Multiple GPUs and Local Response normalization(RN)

Multiple GPUs was used because the GPUs was so fast back then.

Researchers proved that Local Response normalization doesn't help much so far now

don't bother yoursef for understanding or implementing it.

This paper convinced the computer vision researchers that deep learning is so important.

VGG-16

A modification for AlexNet.

Instead of having a lot of hyperparameters lets have some simpler network.

Focus on having only these blocks:

CONV=3*3 filter, s=1, same

MAX-POOL=2*2,s=2

Here are the architecture:

This network is large even by modern standards. It has around 138 million parameters.

Most of the paramters are in the fully connected layers.

It has a total memory of 96MB per image for only forward propagation!

Most memory are in the earlier layers.

Number of filters increases from 64 to 128 to 256 to 512, 512 was made twich.

Pooling was the only one who is responsible for shrinking the dimensions.

There are another version called VGG-19 which is bigger version. But most people used the VGG-16 instead of the VGG-19 because it does the same.

VGG paper is attractive it tries to make some rules regarding using CNNs

Special Netwroks

Residual Networks(ResNets)

Very, very deep NNs are difficult to train because of vanishing and exploding gradients problems.

In this section we will learn about skip connection which makes you take the activation from one layer and suddenly feed it to another layer even much deeper in NN which allows you to train large NNs even with layers greater than 100.

Residual block

ResNets are built out of some Residual blocks.

They add a shorcut/skip connection before the second activation.

The authors of this block find that you can train a deeper NNs using stacking this block.

Residual Network

Are a NN that consists of some Residual blocks.

These networks can go deeper without hurting the performance. In the normal NN –Plain networks- the theory tell us that if we go deeperwe will get a better solution to our problem. but because of the vanishing and exploding gradients problems the performance of the network suffers as it goes deeper. Thanks to Residual Network we can go deeper as we want now.

On the left is the normal NN and on the right are the ResNet. As you can see the performance of RestNet increases as the nwtwork goes deeper.

In some cases going deeper won't effect the performance and that depends on the problem on your hand.

Some people are trying to train 1000 layer now which isn't used in practice

Why ResNets work

Lets see some example that illustrates why resNet work.

We have a big NN as the following:

XàBig NNàa[l]

Lets add two layers to this network as a residual block:

XàBig NNàa[l]àLayer1àLayer2àa[l+2]

And a[l] has a direct connection to a[a+2]

Suppose we are using RELU activations:

Then:

a[l+2]=g(z[l+2]+a[l])=g(w[l+2]a[l+1]+b[l+2]+a[l])

Then if we are using L2 regularization for example, w[l+2] will be zero. Lets say that b[l+2] will be zero too.

Then a[l+2]=g(a[l])=a[l] with no negative values.

This show that identity function is easy for a residual block to learn. And that why if can train deeper NNs.

Also that the two layers we added doesn't hurt the performance of big NN we made.

Hint: dimensions of z[l+2] and a[l] have to be the same in resNets. In case they have different dimension what we put a matrix parameters(Which can be learned or fixed)

a[l+2]=g(z[l+2]+ws*a[l]) #The added Ws should make the dimentions equal

ws also can be a zero padding

Using a skip-connection helps the gradient to backpropagate and thus heps you train deeper networks

Lets take a look at ResNet on images.

Here are the architecture of ResNet-34:

All the 3*3 Conv are same Convs

Keep it simple in design of the network

spatial size/2è #filters*2

No FC layers,No dropout is used

Two main types of blocks are used in a ResNet, depending mainly on whether the input/output dimensions are same of different. You are going to implement both of them.

The dotted lines is the case when the dimensions are different. To solve then they down sample the input by 2 and then pad zeros to match the two dimensions. There's another trick which is called bottleneck which we will explore later.

Useful concept(Spectrum of Depth)

Residual blocks types:

Identity block:

Hint the conv is followed by a batch norm BN before RELU. Dimensions here are the same.

This skip is over 2 layers. The skip connection can jump n connection where n>2

The convolutional block:

The conv can be bottleneck 1*1 conv

Network in Network and 1*1 convolutions

A 1*1 convolution- we also call it Network in Network- is so useful in many CNN models.

What does a 1*1 convolution do? Isn't it just multiplying by a number?

Let's first consider an example:

Input: 6*6*1

Conv:1*1*1 one filter. # The 1*1 Conv

Output: 6*6*1

Another example:

Input:6*6*32

Conv:1*1*32 5 filters. # The 1*1 Conv

Output: 6*6*5

It has been used in a lot of modern CNN implementations likes ResNet and Inception models.

A 1*1 convolution is sueful when:

we want to shrink the number of channels. We also call this feature transformation.

In the second discussed example above we have shrieked the input from 32 to 5

We will later see that by shrinking it can save a lot of computations

If we have specified the number of 1*1 Conv filters to be the same as the input number

of channels then the output will contain the same number of channels. Then 1*1 Conv will act like a non linearity and will learn non linearity operator.

Replace fully connected layers with 1*1 convolutions as Yann LeCun believes they are the same.

In Convolutional Nets, there is no such thing as "fully-connected layers", There are only

convolution layers with 1*1 convolution kernel and a full connection table.

Inception network motivation

When you design a CNN you have to decide all the layers yourself. Will you pick a 3*3 Conv or 5*5 Conv or maybe a max pooling layer. You have so many choices.

What inception tells us is, Why not use all of them at once?

Inception module, naïve version:

Hint that max-pool are same here.

Input to the inception module are 28*28*192 and the output are 28*28*256

We have done all the Convs and pools we might want and will let the NN learn and decide which it want to use most.

The problem of computational cost in Inception model:

If we have just focused on a 5*5 Conv that we have done in the last example.

There are 32 same filters of 5*5, and the input are 28*28*192

output should be 28*28*32

The total number of multiples needed here are:

Number of output*Filter size*Filter size*Input dimensions

Which equals: 28*28*32*5*5*192=120Mil

120Mil multiply operation still a problem in the modern day computers.

Using a 1*1 convolution we can reduce 120mil to just 12 mil. Lets see how.

Using 1*1 convolution to reduce computational cost:

The new architecture are:

X0 shape is (28,28,192)

We then apply 16(1*1 Convolution)

That produces X1 of shape (28,28,16)

Hint, we have reduced the dimensions here.

Then apply 32(5*5 Convolution)

That produces X2 of shape(28,28,32)

Now lets calculate the number of multiplications:

For the first Conv: 28*28*16*1*1*192=2.5Mil

For the second Conv: 28*28*32*5*5*16=10Mil

So the total number are 12.5Mil approx. Which is so good compared to 120Mil

A 1*1 Conv here is called Bottlenect BN

It turns out that the 1*1 Conv won't hurt the performance.

Inception module,dimensions reduction version:

Example of inception model in Keras:

Inception network(GoodNet)

The inception network consist of concatenated blocks of the Inception module.

The name inception was taken from a name image which was taken from Inception movie

There are the full model:

Some times a Max-pool block is used before the inception module to reduce the dimensions of the inputs.

There are a 3 Sofmax branches at different positions to push the network toward its goal. and helps to ensure that the intermediate features are good enough to the network to learn and it turns out that softmax0 and softmax1 gives regularization effect.

Since the development of the inception module, the authors and the others have built another versions of this network. Like inception v2,v3 and v4. Also there are a network that has used the inception module and the ResNet together.