Imagine that you are the director of a new high-tech start-up, looking to sell a brand new product (a new IoT device let’s say) to the market. You participated in many conferences, exhibitions and demos in order to communicate about your product in advance and raise awareness. The direct impact you expect is that the sales of your product, once released, could be high enough to match your objectives.

Imagine that you are the director of a new high-tech start-up, looking to sell a brand new product (a new IoT device let’s say) to the market. You participated in many conferences, exhibitions and demos in order to communicate about your product in advance and raise awareness. The direct impact you expect is that the sales of your product, once released, could be high enough to match your objectives.

Here is one problem: you don’t have any means to anticipate and estimate the demand for your product.

One simple fact you could check is how much people are talking about your product on the internet. You could take a look at the main social networks (Facebook, Twitter, Instagram…) and quantify how many posts mention your product. It would give you an idea of the magnitude of sales you could expect.

You have two options for recognizing a product in a social post. You could either analyse its text by performing Text Mining or you could perform Image Recognition on the post’s image. But there is one problem with analyzing the text: when a product gets released, there is little chance that everyone knows its name and mentions it right away after the release date. Image Recognition is therefore the preferred solution. In order to recognize a product on images (if you have several products to recognize), you will need to perform Object Detection and that’s where Tensorflow comes in.

What is Tensorflow?

Tensorflow is an open-source math library, providing stable Python and C APIs, used for several data manipulation tasks. It is highly known and used for Machine Learning and Deep Learning applications such as Neural Networks. The use cases for which Tensorflow is mostly known are image recognition, natural language processing and speech to text analysis.

Our specific use case is Object Detection. You can use Tensorflow at different levels:

  • Level 1: Use a pre-trained model off the shelf and apply it directly to your data to recognize the products. This method is the least complex and fastest as you don’t need to build or train a model.
  • Level 2: Train a model on your own data, so that the model can learn and know the range of your products accurately. This level requires training a model on data you labeled beforehand. It takes more time but brings more relevant results as the model will recognize the exact product references.
  • Level 3: Create your own model from scratch. In this level, you not only train a model, you build it from the very beginning. It can be parameters tuning on an existing model, as well as developing a new neural network architecture. It is more dedicated for research purposes, as it is a time consuming approach.

Hereafter we describe each level of usage for Tensorflow in more detail.

Level 1 – Use a pre-trained model

The first basic usage is to use a model that has already been trained on a labeled dataset. This option should be chosen whenever the data to which you want to apply the model is very similar to the training data. For example, when you want to recognize generic objects like shoes, smartphones, bags and so on. This approach will therefore not be suitable if you’re expecting to recognize a brand and a precise model from your product range.

Usually, such models are trained on standard datasets, like the Coco dataset (github link here) or ImageNet dataset (link here). These datasets are composed of huge collections of images which are already labeled and ready-to-use for model training. It is widely used to train image classification and object detection models, which can be used as pre-trained models for generic product recognition.

Now that we know more about the data, let’s have a quick overview of the pre-trained model you could use. One of the most common resources to find pre-trained models is Keras. It has a module in which you can find various types of image classification models (VGG16, ResNet50, IncetopnV3…), which have been trained on standard datasets. ImageNet for theses cases. The workflow is then simple, you load the model, and apply it directly to your images in order to extract the features. For example, you can directly classify which of your images contain watches, earphones, headsets… There is one other example more specific to the object detection use case and was developed by Google’s Tensorflow team. They built a series of pre-trained models (one of them is based on the Coco dataset), to be used directly on new images for object detection (github link here). These models could also be used as models to re-train on your own data, which we’ll deal with at level 2.

Level 2 – Train a model on your own data

When you work with a dataset rather different from the original dataset used for training the model, simply applying it will not work. You will need to build your own training dataset and re-train the model on it.

Most image recognition problems require the use of Convolutional Neural Networks (CNN). The first layers (convolutional layers) extract patterns and features from the image. These features are then passed to the last layers (fully-connected layers) for prediction. There are two approches for re-training a model on your own data: Transfer Learning and Fine Tuning.

Transfer Learning consists of using the pre-trained convolutional layers from the original model and train the fully connected layers on the features extracted from our set of images. The idea here is to use the knowledge gained while training the model on the original dataset and use it for our problem. This technique is efficient in cases when the two datasets are similar, for example a model trained to recognize cars could be re-trained to recognize trucks. The reason for that is that car images are similar to truck images and the patterns extracted by the convolutional layers from theses images are likely to be similar. The figure below illustrates the process:

If your dataset is very different from the original dataset, you will need to perform Fine Tuning. Unlike Transfer Learning, here the whole network (or at least most convolutional layers) are re-trained. Thus the convolutional layers are trained on your images and can extract features totally adapted to your images. The advantage of Fine Tuning is that the model is completely trained on the dataset and is likely to have better performances, but it is also more time consuming and computationally expensive. The figure below illustrates the process:

To build a dataset for object detection you need to collect a lot of training images, but “How many exactly?” one might ask. Some say hundreds, others say thousands or even hundreds of thousands, but there is no ground truth actually. However, it is recommended to have more than a hundred images for each class you want your model to recognize.

Once you have collected your images you need to label them. The labelling process consists of creating JSON files containing the coordinates and the class of the object the model should detect and recognize on each image. This part is usually done manually but it can be semi-automated by code. Several open-source tools have been developed to fasten manual labelling, and one of them is the BBox-Label-Tool on Github. The tool consists of a user interface, loading all images and enabling the user to label an image with 2 clicks:

The user only has to click on the top left corner and the bottom right corner of the object, then the tool draws a rectangle around the image and saves the coordinates in a file. This brings the labelling (apart from adding the object’s class) time down to 3 seconds. The final JSON file, after adding the object’s class, should look like this: {“ymax”: 702.0, “xmax”: 543.0, “xmin”: 289.0, “ymin”: 387.0, “classtxt”: “pink bag”, “class”: 4}

The final step is to convert the files into a single file, especially formatted for Tensorflow. A simple tutorial is given here.

Once you have created your dataset, you can start to train your model but first you need to find a pre-trained model! Recent researches in deep learning led to the development of several neural network based models for object detection. One famous architecture is the Faster R-CNN based on Region Proposal Networks (RPN). Faster R-CNN has two networks: an RPN for generating region proposals and a network using these proposals to detect objects. The Tensorflow team at Google developed a series of deep learning models to be easily usable for data scientists and one of them is an object detection model based on Faster R-CNN. The model is provided on Tensorflow’s Github page with a tutorial. The tutorial shows the steps to re-train the model (with Transfer Learning) and deploy it on Google ML Engine.

Level 3 – Create your own model from scratch

If you have the soul of a researcher or you have a very specific computer vision problem, you can create your own model from scratch. In this case, you have three options :

  • Build an image classification model with CNNs

The simplest solution for product recognition is images classification with Convolutional Neural Network architecture. These models take an image as input and output a label indicating the object. They should thus only be used when your images contain only one object you want to detect. However, using an image classification model for object detection is likely to have poor performance simply because of the noise around the object in the image. Image classification models are designed to extract features from the whole image and classify the image as a whole. The problem here is that two different images, for example an image of a room and an image of a street, both containing the desired object, may have the same label. This makes the classification task difficult. Here are two tutorials on how to build CNNs from scratch.

  • Build an object detection model

As said before, object detection models have emerged recently with the research in Deep Learning. The Faster R-CNN architecture was introduced in late 2015 as an iteration of the older architecture Fast R-CNN. It outputs :

a) a list of bounding box

b) a label assigned to each bounding box

c) a probability for each label and bounding box

First a pre-trained CNN is applied to the image, creating a feature map. These features are then passed to an RPN to find candidate regions (bounding boxes) that contain relevant objects. The last step consists of using the features computed by the CNN and the bounding boxes to classify the bounding boxes content and adjust its coordinates (so it better fits the object). This is done via the RCNN module. The figure below illustrates the architecture:

More details are given in this tutorial, and open-source work on the topic is available here.

  • Build a hybrid model using object detection architectures and CNNs

If you want to boost your performances, you may consider the hybrid approach. It is possible that an object detection approach results in bad classification (the object is well detected but the wrong label is predicted). In this case, you can first train an object detection model on your dataset, with only one label (“object” for example) and then crop your images based on the coordinates of the bounding boxes. This will create a set of images containing only the relevant object detected in the original images. You can then use this set as a training set for CNN image classification.

Extension – Automated Machine Learning

Nowadays we can observe the development of new initiatives aiming at making the process of creating and customizing your own model available for anyone, even for people with no technical background. The aim is that instead of using ready-to-apply models (like level 1), you will be able to train your own model without needing to have the underlying knowledge.

How does it work?

The solution involves automating the creation of the image classification model or object detection model. You would just need to provide a labeled dataset with your images, and then send them to the solution. This data will be ingested in order to train a computer vision model on it (as level 2 or level 3). It will output a trained model, sometimes even an API, that you can use to predict the labels and locate objects on new images. The model optimizes itself, so you don’t need to know how to implement it and how to parametrize it.

What are the solutions?

Two main actors are arising in this area, and have released tools which are still in beta at the moment.

  • Firstly, we have Google and its Google Cloud AutoML component from its Cloud platform suite (link here). Not only is Cloud AutoML designed for computer vision problems, but it can also be useful for any Machine Learning applications, as well as Natural Language Processing and translation.
  • Secondly, we have the open source initiative AutoKeras (link here). It is based on Keras, and relies on automatic search for architecture and hyperparameters of deep learning models.

Co-written article by Matthieu Montaigu and Kasra Mansouri, Data Scientists at Artefact.

You’re interested in knowing more about sales forecasting thanks to tailor-made product recognition models ? Our experts are at your disposal to answer any questions you may have ! 

Artefact Newsletter

Interested in Data Consulting | Data & Digital Marketing | Digital Commerce ?
Read our monthly newsletter to get actionable advice, insights, business cases, from all our data experts around the world!

Newsletter Sign Up