Welcome back to the deep learning example to build an OCR application. The idea of this simple application is to identify numbers in an image of written text. In the last part, we used three different models and got the following accuracy for identification of the test images:
- Model 1 – Logistic regression: 92% accuracy
- Model 2 – Random forest: 96.8% accuracy
- Model 3 – Deep learning network (fully connected): 97.9% accuracy
In this part, we will raise the accuracy further to 99.1% by using convolution neural networks (CNN). But before that, let’s visit a beach and sunbathe a bit.
Convolution Neural Networks and Polarized Glasses
Beaches are great places to vacation but don’t forget your polarized glasses when visiting a beach. Polarized glasses cut down the bright sunlight to half still allowing you to see the beautiful views properly. Essentially, you are seeing the same thing, possibly better, with half the information. The science behind polarized glasses is simple where the sun emits unpolarized light i.e. electromagnetic waves in all directions. The polarized lens, as in this diagram, blocks or filters out all the components of the sunlight except the vertical component.
In theory, a polarized lens should block exactly 50% of the visible light however, in reality, they usually block anything between 42-47% visible light. Moreover, physical limitations allow just some specific shades or filters for polarized glasses.
Convolutional neural networks work similar to polarized lenses where they reduce the information of an image without losing the meaning. Moreover, convolutional neural networks don’t have physical limitations like polarized lenses hence you can create virtually infinite types of filters.
This polarized lens kind of operation is performed in two steps in convolutional neural networks. These steps are called convolutional and max-pooling. These two steps are applied multiple times to reduce the information load of an image without losing its meaning. Let’s see how these two steps work.
Convolution filters and Layers
In the previous article on deep learning for OCR, we learned that an image is nothing but a matrix of numbers. For Instance, this logo of YOU CANalytics has three matrices with 120 rows and 607 columns each. The three matrices are for the red, blue, and green channels (RGB). Essentially, this image is nothing but an arrangement of 218520 numbers (120 x 607 x 3).
Now, the polarized lens of convolutional neural networks i.e. convolutional filters is a matrix. Here, these 3 x 3 matrices correspond to images filters shown below. These 3 x 3 filters, when they traverse through the original image, produce effects such as detection of edges, sharpening or blurring the image. Here, you can see an example of a 3 x 3 filter traversing through a 5 x 5 image matrix to produce a filtered image. Note, the convolutional neural network learns each element of the filter matrix on its own, unlike these predefined filter matrices.
A convolved feature usually has fewer dimensions than the original image i.e. a 5 x 5 image became 3 x 3 here. In practice, images are padded up a bit to produce convolved features with the same dimensions as the original image i.e. a 5 x 5 image will produce a 5 x 5 convolved feature. The padding is essentially adding zeros across the 5 x 5 image to make it a 7 x 7 image and then the convolved image will have 5 x 5 dimensions.
Now, let’s run the edge detection filter across the YOU CANalytics logo and see what happens. You could find the Python code used for this section here: Convolution and MaxPooling using OpenCV (Jupyter Notebook).
The edges are enhanced in the output while the rest of the image has muted down i.e. it became black. The output matrix is still the same size (120 x 600 x 3)
Remember, our goal for this entire operation is to reduce the information load without losing the meaning in the image. After applying the convolutional layer with padding, we got a matrix with the same dimension as the original image hence we have not reduced the information. The max-pooling layer in the next segment will do the job of reducing the information.
This video gives a glimpse into the intuition behind the max-pooling operation. Please note, this is not exactly how max-pooling work.
Here, a smaller dimension image preserves the information within the larger dimension image. Max-pooling is a very simple matrix operation where you take the maximum value of the smaller blocks (say 2 x 2) of a large convolved feature matrix.
Max-pool, as you must have noticed with the 2 x 2 block, reduces the size of the to the square root of the original image. Now, let’s notice what happens to the YOU CANalytics logo when you apply the max-pool operator.
The original 120 x 607 x 3 image is reduced to one-fourth its size i.e. 60 x 303 x 3. Max-pool operator, however, in the convolutional neural network is almost always applied after the convolutional layer. Hence, the process of reduction of the original image for a CNN like operation will look something like this. Notably, the text and graph within the image are highlighted after these two operations. Now with this knowledge, you are ready to develop your convolutional neural network to improve the accuracy of the OCR engine to over 99%.
Convolutional Neural Network (CNN) for OCR
The CCN architecture we will use to improve the accuracy of our OCR application has two max-pooling layers sandwiched between three convolutional layers. The last convolutional layer is flattened out, like the last part of this series, to feed into the fully connected network. This is the Python code used to train the CNN: Convolution Neural Network – Python Code (Jupyter Notebook). This is the network diagram with the number of parameters (weights) learned in each layer.
Layer (type) Output Shape Param # ================================================================= conv2d_1 (Conv2D) (None, 26, 26, 64) 640 _________________________________________________________________ max_pooling2d_1 (MaxPooling2 (None, 13, 13, 64) 0 _________________________________________________________________ conv2d_2 (Conv2D) (None, 11, 11, 64) 36928 _________________________________________________________________ max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64) 0 _________________________________________________________________ conv2d_3 (Conv2D) (None, 3, 3, 32) 18464 _________________________________________________________________ flatten_1 (Flatten) (None, 288) 0 _________________________________________________________________ dense_1 (Dense) (None, 512) 147968 _________________________________________________________________ dropout_1 (Dropout) (None, 512) 0 _________________________________________________________________ dense_2 (Dense) (None, 64) 32832 _________________________________________________________________ dense_3 (Dense) (None, 10) 650 ================================================================= Total params: 237,482 Trainable params: 237,482 Non-trainable params: 0
A single input image of ‘0’ when passed through this trained-network created the following impressions across the layers. The final output from the softmax layer, towards the extreme right, is the probability of digits between 0-9. Here, the trained network has identified the image as ‘0’ with almost 100% accuracy.
Let’s see how the original image gets transformed into the output probability. In the first convolutional layer, the original 28 x 28 x 1 image is converted to 64 different images with 26 x 26 x 1 dimensions. Here, 64 convolution filters of size 3 x 3 are used. The first convolutional layer has 640 parameters to calculate since it will estimate nine values of 3 X 3 filter matrix and one bias term for 64 different outputs in the next layer i.e.
These 64 filter matrices, learned by CNN after training the network, when multiplied with the input image of zero (28 x 28 x 1) produces these 64 outputs. All the 64 outputs are shown in one single image below. Notice, how the original zero in the input image is preserved in the outputs for the first convolutional layer. When I say preserved, it means that human eyes can still see the zeros in these outputs. Soon, you will notice that when we apply more convolutional filters our eyes will not be able to identify the input zero, however, the network will still preserve the digit in the original image to make the right estimation.
The next max-pool layers have no parameter to learn since they are simple max operations across the input matrices. The next convolution layer, layer-2, will produce 64 images from the 64 input images of the first max-pool layer. The total learnable parameters between these layers are calculated as:
After the second convolutional operation, the zero in the image has started fading but there is still a glimpse of the zero in the original image.
The third and final convolutional operation estimates 18646 parameters. Remember, the last convolutional layer produces 32 output images of much lower dimensions of 3 x 3 x 32.
The output for the third convolutional layer is displayed here. Our human eyes can’t see the embedded digit in these 32 output images from the third convolutional layer. However, these images have compressed the information from the original image and preserved the embedded zero in the original image. Or more precisely, this pattern generated for the digit 0 is significantly different from the patterns formed by digits 1 to 9 on a 3 x 3 x 32 canvas.
This output becomes the input for the fully connected network similar to the one we used in the previous part. Recall, last time when we used the original images as input without the convolutional and max-pooling operations we got the test set accuracy as 97.9%. This time with these additional operations the accuracy has shot up to 99.1%. Here, CNN has reduced the error by close to 60% from the best model we had earlier.
If you have not noticed, the drawing at the top of this article captures Audrey Hepburn in a famous scene from the movie ‘Breakfast at Tiffany’s’. In this scene, before Hepburn instructs a cab driver, ‘Grand Central Station. And step on it, darling.’, she briefly slid her glasses close to her lips and put it back on. In that brief moment, she must have experienced the power of convolutional neural networks on an unconscious level while seeing the New York city with and without lenses.
Many thanks for detailed explanation about CNN. I have been following your blog since last 2 months and the way you explained the things is very intuitive. I will look forward for next content on your blog. Thanks.
Thanks, Dilip, for the nice words.
could you please create blog regarding LSTM. I am having hard time to grasp it.
Amazing work Roopam.
Thanks a lot for this series 🙂