Semantic Segmentation Technologies and Training Method through Deep Learning

In the computer vision field, semantic segmentation is the process of dividing a digital image into multiple sets of pixels, by which segmentation simplifies and transforms the representation of the image into something which is easy to interpret. Along with object detection, semantic segmentation is the most widely used process in the field of computer vision.

[Figure 1] Example of semantic segmentation (Source: COCO 2020)

If object recognition shows classification results for a specific area in the image, semantic segmentation shows classification results for all pixels in the image. Therefore, as shown in the example above, it is possible to extinguish the meaning of each part of the image (people, roads, walls, trees, rivers, grass, sky, etc.).

[Figure 2] Example of semantic segmentation and object detection (Source: Stanford University School of Engineering)

One of the areas that mainly utilize semantic segmentation technology is autonomous driving. Tesla, a leader in autonomous driving using artificial intelligence, analyzes images in real time through semantic segmentation, object detection, and monocular depth estimation technology.

In particular, semantic segmentation plays a very important role in Tesla’s autonomous driving to distinguish vehicle from people, road from things other than road, central line and crosswalks in the image.

[Figure 3] Example of Tesla’s use of artificial intelligence (Source: Tesla)

We have talked about this naturally but it has only been a few years since semantic segmentation became available for mission critical [1] areas such as autonomous driving. Until then, it was almost impossible to conduct semantic segmentation to the level of high accuracy responding to various situations through computer programming. Semantic segmentation achieving performance similar to the level of people just recently can be said as a historical development in the field of computer vision.

What made all this possible was deep learning technology represented by the recent artificial intelligence. As deep learning is combined with computer vision field, image interpretation technologies such as convolution neural network developed at a rapid rate, and analysis with high accuracy became possible.

As we know, AI can become very smart through deep learning, but AI have to be train based on vast amount of data to do so.

Then, let’s look at what kind of data artificial intelligence trains from to make semantic segmentation possible.

[Figure 4] Left: Original image, right: Processed image (Source: ADE Challenge 2016)

Training data can take various forms depending on the definition, but the process of creating training data for semantic segmentation is as follows.

  1. Set class (meaning type) of semantic segmentation
  2. Categorize original images in pixels according to predefined classes
  3. Create processed images with RGB values of pixels changed according to class
  4. Create mapping information for classes and RGB values

To train artificial intelligence for semantic segmentation, one must decide which class (semantic type) to segment the image into. The class can be set to the things that artificial intelligence wants to recognize. In the image above, for example, background, people, trees, flower pots, chairs, bags, etc. can be the class.

Next, it is necessary to differentiate original images according to the predefined class. This is the process of letting the AI know the correct answer. As a person needs to mark the correct answer on an image, the tool to perform this task is necessary.

[Figure 5] Classification at the semantic unit of images through blackolive, which is Testworks’ processing tool.

If images are classified at the semantic unit using a tool, it is possible to create image with changed RGB value of each pixel from classified information. There exist many colors for the convenience of task in the segmentation in the tool, but this can be made into a simpler form for the AI training. The form like processed image in [Figure 4] is the image with changed RGB value of each pixel to (1,1,1) (2,2,2) (3,3,3) … of each class and background (0,0,0).

Finally, class and RGB value mapping information are used to define which RGB value of processed image shows which class it belongs to. If it is a simplified processed image, numeric ID can be attached and RGB value can be changed according to ID. For example, the RGB value of “people” in class 1 is (1, 1, 1) and the RGB value of the “vehicle” in class 2 is (2, 2, 2).

After the training data is ready, AI can go through training by analyzing the image after receiving original image as an input, conducting semantic segmentation at the pixel unit, and comparing the results with the correct answer of processed image.

When the training is finished properly, AI can conduct semantic segmentation of images obtained from a camera and can be used in the fields such as autonomous driving. The figure below is an example of sematic segmentation of camera images conducted by AI, making it more vivid by coloring by class.

[Figure 6] Left: camera image Right: Results of semantic segmentation

Do you think that it can distinguish vehicles, people, trucks, sigh boards, roads, sidewalks, plants, trees, buildings and sky well?

These semantic segmentation technologies are being used in various fields such as unmanned robots, security, anomaly detection, factory automation, and medical care, as well as autonomous driving.

Artificial intelligence, which is developing significantly with deep learning, is expected to do more in the future.


[1] An area where a trivial mistake can lead to fatal consequences


Beomjin Kim

Assistant Research Engineer, AI Research & Development Team

BA, Business Informatics, Konkuk University

He is growing as a software engineer by accumulating various experiences at Testworks with an aim to provide convenience to many people through various technologies. Currently, he is in charge of deep learning model and AI service development at the AI Research & Development Team.