Processing Images In Neural Networks:

Ayoola Olafenwa
9 min readOct 28, 2019

--

Human’s view of images is plain and simple, we can easily describe the content of what we see in an image. Our eyes are created to process images very efficiently, hence it is easy for humans to easily recognize the image of an object known. The way machines process images greatly differ, it is more complicated for a machine to view an image and recognizes the content in it. The principles behind the processing of images have created an important area in machine learning known as Computer Vision.

Computer vision is the ability of computers to recognize and describe what they see. The way a machine recognizes the patterns of an image to interpret its content differs from human’s perspective. Before we go into the intuition of how a machine is able to recognize the patterns of an image, we must first of all ask ourselves this question. What is an image to a machine? An image is simply a tensor(Tensor is an N-dimensional vector) of pixel values. Pixels are intensities of light. Based on the pixel values of images, we classify images into grayscale images(black and white images) and RGB (coloured) images. Pixel values range from 0–255.

Grayscale images:
They are 2-dimensional images, with different shades of gray between 0(black) and 255(white). They have one channel.

RGB(Red, Green and Blue)Images:
RGB images are 3-dimensional images. They are made up of pixel values arranged in rows and columns with 3 different channels of rows and columns stacked upon each other(Red, Blue and Green channel). Every RGB image is of the dimension — width * height * 3-channels. These channels are named after their color spectrum and each of the pixels corresponds to a specific light intensity. Red-channel represents red light spectrum:Green-channel represents green light spectrum: Blue-channel represents blue light spectrum. Every colored image is a RGB image. In this article we shall make use of RGB images.

Note: Red, green and blue are the primary colors from which other colors are produced.

Computers make use of the pixel values of an image to classify an image and interpret the content of the image. Our main aim is to show the techniques involved in designing an Artificially Intelligent model that will be used as a guideline for a computer to process an image making use of its pixel values to be able to successfully classify the image.

  • An overview of the classes of images in Cifar10 dataset.

Cifar10 dataset: is the dataset we are going to use to explain the techniques involved in processing images in neural networks. Cifar10 dataset consists of 60,000 32 x 32 RGB images, split into two sets, train set and the test set. The train set consists of 50,000 images for training our model and the test set consists of 10,000 images for testing our model. There are ten classes of images in cifar10 dataset which are:

  • airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck.

This code written above will automatically download the cifar10 dataset. The cifar10 dataset is about 170mb in size.

The next step after loading the images is to normalize the images in the dataset by dividing each image pixel value by the maximum pixel value 255”. The purpose of this, is to scale the pixel value of each image to a range of 0 and 1. This is because neural networks are not good with processing images with high pixel values, they are better with handling images with normalized pixel values.

Neural networks cannot handle categorical label strings of the images in cifar10 dataset. We have to convert the labels to vectors, keras one hot encoding make it possible for us to convert the labels to vectors.

ConvNets: Convolutional Neural Networks will be the type of neural networks we are going to use to define the model to train the cifar10 dataset. They are effective networks used for producing complex network of layers that are suitable for processing images. They work on the principle of extracting features from the image with the use of filters. Filters are used by these networks to extract features from the images, the greater the number of filters the more the features they can extract. Filters are feature detectors and extractors, they detect features in an image and extract them. These features extracted from the images are used by the model to classify the images.

Keras Functional API: This is the API we are going to use to construct our model. In functional AP1 we shall define a module that contains a specific number of layers. We shall then define another function where we stack up as many of these modules as we wish. To know more about functional API, visit keras functional API.

In the function above we defined a function named “module” and passed in the following parameters:
-x represents the input images
-filters represent the number of filters defined in each convolutional layer.

We have two layers in the module function and the final output in the last layer is returned. Before we proceed further we should have a better understanding of BatchNormalization and Activation function .

BatchNormalization: It is used to reduce vanishing gradient problem in a network. The gradients of the loss function with respect to the parameters of the layers are used to update and adjust the internal weights of layers in a network during training. Vanishing gradient problem arises when a network becomes deeper as more layers are added, the gradients become smaller and smaller and internal weights cannot be adjusted. Vanishing gradient problem reduces the effectiveness of the network. BatchNormalization is able to solve this problem by normalizing each batch of the feature maps of the input data to have a mean of zero and a standard deviation of 1 with respect to the statistics obtained per batch during training.

We pass the input(x) the images into a batchnormalization layer to normalize the feature maps of the input to have a unit variance and a zero mean with respect to the statistics obtained per batch while training the images. It prevents the gradients from becoming extremely small as more layers are added.

Activation Function:The output from the batchnormalization is passed into the activation function. Relu is the activation used. It returns the original output of the BatchNormalization layer to the convolutional layer and activates the neurons in the convolution layer. Relu is the commonly used activation in the field of computer vision. There are other activations like sigmoid, perceptron and tahn.

Convolution layer: The covolution layer takes in many parameters, we shall look into the parameters one by one:

-filters: I have mentioned earlier that filters are feature detectors and extractors. The number of filters in a convolution layer will be the same number of features that the filters will detect and extract from an image. A convolution with 5 filters will detect and extract 5 features from an image.

  • kernel_size: Filter can also be called a kernel. Kernel size is the size of the filter which can be 2 x 2, 3x3, 4x4. But 3 x 3 is the common filter size used in convolutions.

Note: The kernel_size can be represented by a single integer such as kernel_size = 3 or a list of two integers such as kernel_size = [3,3]. In this article we make use of the list representation. Both representations are correct, it depends on the one you prefer.

  • strides:It is the number of steps that the filters in the convolution layer move over the pixels of an image. “strides = [1,1]” It means that the filters move over the image pixels one by one.

Note: The strides can be represented by a single integer such as strides = 1 or a list of two integers such as strides = [1,1]. In this article we make use of the list representation. Both representations are correct, it depends on the one you prefer.

-padding:It is set to “same” to ensure that the image maintain its original dimension after convolutions are applied to it.

This return statement returns the final output in our module, this final output serves as input to the layer in our next module.

Let us define our model by stacking up the module defined above.

Our function model_structure takes in the parameter “input_shape” which represents the input shape of the images.

We define the first layer in the first module with 32 filters, this means that this layer will be able to detect and extract 32 features from an image.
We proceed and stack up nine modules with each module having two layers each, therefore there are 18 layers in our model. After stacking three modules we apply MaxPooling2D.

MaxPooling2D: Pooling is a technique that is used to reduce the dimensions of images. MaxPooling2D takes the maximum pixel in an image region and discards the rest. This helps to pick the most important features in an image. Because pooling reduces the dimension of an image, we increase the number of filters in the next layers.

Dropout: It is a technique that is used to reduce the problem of overfitting(a situation whereby our model achieve high accuracy on the training dataset and fails or produces low accuracy on the test dataset). We deactivate the activations of all layers by ratio 0.2, to prevent co-adaptation of features that can result in overfitting of the model.

AveragePooling2D: We added two MaxPooling2D layers. Pooling reduces the dimension of an image. The dimension of each of the images in the cifar10 dataset is 32 x 32. When the first pooling is added the dimensions of the images are reduced by half to 16 x 16. The second pooling added reduces the dimensions further by half to 8 x 8. We added another pooling layer called AveragePooling2D with a pool size of 8 x 8 that divided our images’ dimensions to become a unit scaler (1,1). Therefore the final image’s dimension becomes 1x1x10. The number “10” corresponds to the number of neurons in the output Dense layer.

We must flatten our images before we put them in the fully connected layer which is the Dense layer, because it does not support tensors with dimensions greater than 1. The dense layer has 10 neurones, the number of neurons corresponds to the number of classes in the cifar10 dataset. We make use of softmax as the activation in the dense layer, softmax outputs scores for each of the classes in the dataset. Finally we return our model for the purpose of compiling and training.

It prints a summary of our model like this:

we define a learning rate function where we specify the procedure in which our learning rate is going to change at a specific range of epoch.

We shall compile our model, we make use of Adam optimizer to optimize our model and reduce the loss. Our loss here is categorical_crossentropy because we are dealing with a classification problem which outputs categorical values. The metrics will be “accuracy”, because it is a classification problem. To know more about different kind of optimizers visit keras optimizers.

Data Augmentation: This helps the model defined to be more effective in recognizing an image. This enables the model to recognize different representations of an image, to be able to recognize an image whether it is flipped, rotated and in any way in which it is represented. Data Augmentation is achieved with the ImageDataGenerator class. For the purpose of data augmentation we subtract the mean image from each of the image in the dataset and divide by standard deviation.

The code below for data augmentation:

-The batchsize is to 32 and in the steps per epoch we divide the train set by the batch size. The value of the ratio obtained will be the number of image batches that will be trained at each epoch. The number of epochs is set to 30 i.e we are going to iterate through the entire dataset 30 times.

We pass in the train and test dataset from the datagen function where we applied data augmentation on the images.

The model is evaluated using the test dataset and the model achieved an accuracy of 84% with 30 epochs. Higher accuracies of over 90% can be achieved by increasing the number of epochs to about 200 and adjusting the learning rate for 200 epochs.

The full code for training:

--

--

Ayoola Olafenwa
Ayoola Olafenwa

No responses yet