Categories: Development

Computer Vision and AI: My Experiences

In 2019 and 2020 I participated in two computer vision challenges and learned a lot: the Kaggle Plant Seedlings Classification Challenge and the Data Science Bowl.

Kaggle is probably the best known platform for Data Science and Artificial Intelligence Challenges. Competitions are regularly held there in cooperation with companies. These usually provide very well prepared data and the task of the participants is to create the best possible predictions on this data. The best participants can win cash prizes, but most participants are interested in competing with the best data scientists and machine learning engineers in the world. In the Plant Seedlings Challenge, the task was to identify which of twelve different species of plant seedlings could be seen in images.

The data set was small with 4000 images for training, but our model was still able to predict the correct class with an accuracy of 98.3 percent. Training the model took only a few hours.

In the Data Science Bowl, the task was to identify each nucleus in microscopic images of cellular material. An additional challenge was to recognize the nuclei despite different resolutions of the images and types of staining of the cell material.

In both challenges, we secured a place in the top 10 percent of participants. The challenges are a good way to show the progress of computer vision.

First, there is a rather simple task: determine what is in the image. That’s what this blog post is about.

Computers recognize objects better than humans

Computers are much more proficient at detecting objects than humans.

Since 2012, there have been incredible advances in computer vision every year. These have been driven by Deep Learning and advancements in Deep Learning models and methods. As a benchmark, researchers most often compete in the Large Scale Image Recognition Challenge using the ImageNet dataset. This dataset was created by researchers and consists of over one million photos labeled by humans. The models must recognize which of a total of 1000 possible objects is in an image. There are 1.2 million images available to train the models. How well the models recognize the objects is tested with 100,000 images. The differences between the classes are sometimes very subtle. For example, more than 120 different dog breeds are represented in the data set. The models have to recognize not only that there is a dog in the image, but exactly which breed of dog it is. It’s similar for birds, snakes and monkeys.

The ImageNet dataset can be viewed here:

How Deep Learning Models See


Neural networks are built in layers. Data flows into the first layer and is processed there; the results are passed to the next layer. At the last layer, the network outputs its prediction. In the case of the Plant Seedlings Classification Challenge, the prediction is to which of the 12 given classes an image belongs.

In 2014, the VGG network won the Large Scale Image Classification Competition. In the picture you can see how it works in a simplified way:

On the left side, an image with a height and width of 224 pixels and 3 dimensions for the Red, Green and Blue channels is entered. On the right side, 1000 individual values are output. This is the network’s estimation for each of the 1000 objects whether they are visible on the image. In between, the input is processed so that the height and width become smaller, but more and more dimensions are created. So in the first layer, there are still 224 pixels of width and height, but 64 dimensions. In the next layers, the size is halved to 112 pixels, but the dimension is doubled to 128.

Convolutional Neural Networks

Each of these additional dimensions describes where a particular feature is located in the image. A feature can be as simple as an edge, or as complex as an eye or a car rim. The feature is found by sliding a filter over the image that searches for just that feature.

You could think of it this way: Someone searches the image with a magnifying glass, starting at the top left. They are looking for a specific feature, in our example the vertical edge shown in blue. When he finds one, he makes a cross on an extra sheet in the upper left corner. Then he moves the magnifying glass 1cm further and makes another cross on the same sheet when he finds another edge. He continues like this until he has searched the whole image. Then he starts again, but looks for another feature, for example a horizontal edge – here in green. He notes his finds again as crosses, but on a new sheet of paper. He does this for many different features. Each of these papers with the crosses then corresponds to one of the dimensions described above. In the next steps, these sheets are put on top of each other, so to speak, and thus result in a new picture, in which features are searched for again. For example, one could now search for corners, which consist of the previously found edges. This continues until the images are only a very compressed representation of the features. The feature maps in the VGG mesh end up being only 14 pixels wide and tall, but 512 different features are represented in them.

Sliding a filter over the image is called convolution – hence the name of Convolutional Neural Networks.

The following image shows a convolution. The input is blue and the filter is gray. From these a so-called featuremap is calculated, which is shown in green. This featuremap indicates where in the input image the feature was found. In the example with the magnifying glass, the featuremap would be one of the sheets with the crosses.

Image source

In Convolutional Neural Networks, the feature map is computed by multiplying the filter by an appropriate subregion and summing the result. (The scalar product of a portion of the image and the filter is calculated.) In the following example, in blue is again the input image. This has a vertical edge. Light pixels have the value 1 and dark pixels have the value 0. In gray is the filter. The filter detects vertical edges. If it is multiplied by an image area with a vertical edge, the result is a higher value than for an image area without a vertical edge. As a result, the feature map has high values only where the edge is visible in the corresponding input.

Image source / license

How do filters look?

In the next layers of the neural network, filters are again multiplied by the feature maps of the previous layers. Thus, features of features are created. This concept is what makes Convolutional Neural Networks so powerful. In the lower layers, simple features such as edges and color gradients are searched for. In the following layers these are combined to more complex features like edges, curves and simple patterns like honeycombs or text. In more advanced layers, objects are already recognized, such as car rims or eyes. Last, the network can recognize the object in the image by assembling individual features. If, for example, two hairy, pointed ears, a snub nose and whiskers are visible, it is – probably – a cat.

In their paper, Zeiler and Fergus show an image in which you can see the filters of a Convolutional Neural Network. It was trained on the ImageNet dataset (pfd) In mostly gray, various filters can be seen. Next to them are sections of images where these filters output particularly high values.

The following picture shows a simple Convolutional Neural Network with 10 filters. Under CONV you can see the resulting feature map. At RELU you can see the same featuremap, but negative values were set to 0. Under POOL the size of the featuremap is reduced. Under FC you can see the confidence of the network to which class the image belongs. You can see that the featuremaps don’t make much sense for us humans in the deeper layers. There only single bright pixels can be seen. However, the models can make very good predictions with it. You can see a live version of this model at

How the filters are created

Initially, the filter matrices are filled with random values. In training, the model learns what these filters must look like so that it can detect the correct object in the image as error-free as possible. For this purpose, the model is shown an image which it is to classify. Then the error of the model is measured at the output side. This is used to calculate backwards layer by layer how the filters have to be changed to make the error smaller. This change of the filters is done with many different images and in many small steps until the model shows no more improvement.

Training a computer vision model

Training a model from scratch takes time. For example, training modern models on the ImageNet dataset with a single high-performance GPU takes around two weeks. It also takes a lot of training data. ImageNet has 1.2 million sample images available for training.

In the Plantseedlings Challenge, there were only 4000 images to train; yet we were able to train a model very accurately within a few hours. The method that makes this possible is Transferlearning.

Transfer learning

Transfer learning takes advantage of the fact that the filters of already trained meshes look for very simple features in the lower layers, such as straight lines, corners or circles. Thus, one takes an already trained model for e.g. the ImageNet dataset and replaces the top layers with new and untrained layers, but adapted in form to the new task. The bottom layers are kept and initially frozen so that they cannot be changed during training. Thus, during training, the model has to adapt the new upper layers to give good results with the feature maps from the lower layers of the previous task. After a few training steps, the model is also allowed to change the filters of the lower layers. For example, the model might modify a filter that looks for rims in the ImageNet dataset so that the filter now looks for a specific shape of a leaf.

The model has learned from the ImageNet dataset which features make up objects in photographs and can now use these features to learn how to recognize completely different objects. This works even when the new images bear little resemblance to the ImageNet dataset, such as satellite images or microscopic images of cellular material.

Reduce training time

Transfer learning has become the standard method for computer vision problems. It reduces both the training time required and the amount of data needed. Still, the models produce very good results – such as our 98.3 percent accuracy in the Plant Seedlings Challenge.

In this blog, I showed how a model figures out what’s in an image. In the next stage, we figure out where in the image what object is. I will describe this part in my next blog post.

Article info

Leave a Reply

Your email address will not be published. Required fields are marked *