Computers Are Watching.

7 min readFeb 9, 2021

Humans love automating things. We absolutely admire efficiency. We can see throughout all of history that we want to make tasks easier, and simpler… The truth is that we never settle for what we have, we always want it better. Before society existed at all, we wanted tools 🪓 to be better and better. Then came the industrial revolution which made manufacturing ⚙️ much faster, but no, we were not satisfied yet. We decided that adding numbers were boring and slow, so we made computers 💻. Today, we are trying to automate thinking like a human using AI.

As you might have noticed, AI is a very general field. Therefore I am here to tell you a branch of AI that really fascinates me; computer vision, Convolutional Neural Networks to be specific

What You Can Expect From This Article

After this article you will have the background knowledge to be able to design convolutional neural networks… maybe invent something new 😉

Of course, your brain suddenly won’t double in size. Sadly, you won’t suddenly become as good as Yann LeCun at CNN’s, but it is an essential & big step to understand these concepts if you want to dive into computer vision.

What Exactly Is A CNN?

A CNN is extremely similar to any other Neural Network, except they use different types of layers. As you know, Neural Networks are made of layers, but there is not only one type of layer, there are many types (Ex. Dense Layer). Convolutional Neural Networks use layer types that help with image processing.

Now, the black boxes in the image represent layers in the neural network. A convolutional neural network is like any other, just the types of the black boxes are different.

Enough with me blabbering on about different layer types, let’s actually get into what these layers are and what they do.

Images Are Matrices of Numbers

Just one more thing 😅 I promise! It might be confusing how we input an image to a neural network, but its actually quite simple. An image is a 3 dimensional array. It is made up of 3 grids (3 matrices) that are each grids of how much blue, green and red there is in each pixel of the image.

Even though an image is technically 3 dimensional, I will be drawing it as a 2-D grid filled with numbers since it is much easier to explain, and understand.

Alright, now time to get into what you’re here for!

Layer #1 Convolving Layer

This layer passes a filter through the grid of the image, shrinking the image, and at the same time potentially detecting edges.

The filter is a square grid with numbers inside of it. Since this is a grid just like the image itself, we can overlap it anywhere in the image, as long as it is in-bounds. This is exactly what we do. we start at the top left corner, move 1 step to the left, then another, etc. Once we are at the end of the row, we go down to the beginning of the row beneath it..

The Filter like the yellow area in the below image.

The green grid would be the image itself, the yellow box going over it being the filter that we were talking about before.

You might be wondering what the convolved feature is (red box to the right). So when we place the filter on the grid (image), it overlaps a 3 by 3 area (or whatever the dimensions of the filter are). For each box that is covered on the image, we multiply that number with the number in the corresponding position of the filter. We later sum up the resulting numbers to get only 1 number.

How we turn a grid of 9 numbers into 1 number

Now these filters have 2 benefits (Like mentioned earlier).

One being the fact that we shrink the image. Think of it this way: If we had a 100 x 100 image, and we wanted to process that with a dense layer leading to a series of 100 neurons. Guess what? That would result in 1 million parameters. I wish good luck to your computer for handling that one.

The second benefit is that we can detect edges in the image. That is, if we place the correct numbers in the filter that we pass through. One example is vertical edge detection:

As we can see, there are large numbers (30 in this case) near the corresponding spot of the edge after convolved. This is vertical edge detection. Even better, we can alter the contents of the filter in order to make it horizontal line detection.

Layer # 2 Padding

This a layer that helps enhance the capabilities of the previous layer mentioned, the Convolving Layer. Here’s the problem with only using the convolving layer:

From one perspective, the edges and especially the corners of the image are not getting enough attention. When we pass a filter over the image, a pixel in the corners/edges of the image is considered way less times than one. say, in the middle.

Let’s take the red pixel in the above example. The convolution filter only “embraces” it 4 times in total as it passes through the image. Therefore, this pixel has an influence over 4 different filter positions.

However, when we look at a pixel near the center (like the blue one above) many more filter positions are influenced by it. In this case 10 different positions are influenced by the blue pixel. That is much more than the red pixel which was 4.

But why does this matter? Because of this, the pixels near the middle have more of a “say” on what the final output gets to be. This can lead to the underrepresentation of important information that might show up at the corners of the image.

So how do we fix this? To be able to solve a problem we have to understand the root cause of it.

The reason there are less filter positions near the edges is because the filter cannot embrace the pixel from all positions since it might go out of bounds of the image.

As in the image above, the filter positions shown literally do not exist since the filter would be out of bounds of the image. Since all of these positions are non-existent, the red pixel is not influencing them. However, the blue pixel does not have this problem since it is near the middle of the image making boundaries far far away.

Padding fixes this problem by… well, creating a padding around the image made of white pixels that don’t have an influence on the final decision. It extends the borders thereby making the previously “out of bound” positions valid. This increases the consideration of the pixels on the original image’s borders, conveniently giving them the same attention to the ones in the middle.

The Blue pixels are the margin added. As you can see, the red box is now valid even though it was previously out of bounds.

Layer #3 Strided Convolutions

This layer is extremely similar to the first layer that I described.

Remember when I was mentioning convolutions going over every possible position in the image? This is the same thing, except the distance that the filter moves each time is more/less. In the first layer I mentioned the convolution just went through each row pixel by pixel. With this layer, the filter can go through every row, but skipping pixels. Maybe skipping 1, maybe skipping 2, maybe 4… Can be anything.

The image on the left is what you are used to. The convolving filter moves 1 by 1. However, the image on the right skips pixels. To be exact, that would be a 4-strided convolution since it increases it’s x/y value by 4 each time the filter moves.

Conclusion

By using these three different layers in any order you want, you can process any image. These different layers are a masterpiece, it literally gives the computer the sense of sight.

The End

I hope you enjoyed the article! If yes, then don’t forget to clap… and maybe even follow 👀 I put out content consistently, so if you’ve enjoyed this article I can promise you that there will be a new one coming up soon!

I would also like to mention that this article was a summary of the first week of Andrew Ng’s CNN Coursera, it really inspired me.

Hope to see you again, reader :) 👋

Ps. Don’t be afraid to say hi on other platforms!

LinkedIn | Twitter | Newsletter | YouTube