Building AI Training Datasets Through Automation Technology

Data is one of the three key elements in AI Industry. The other two are algorithms and computing power. Testworks is a company specializing in processing artificial intelligence data: i.e., researching, and developing artificial intelligence technology to build AI data more effectively. We want to increase productivity by reducing time and cost in building datasets through automation technology using artificial intelligence models.

The process of building AI training datasets

The process of building a dataset used for training a artificial intelligence model [1] is as follows.

Figure 1 [2]

First, we collect raw data such as images, videos, audios, and texts that are needed for training AI ​​models. The collected data are then cleaned and refined through the pre-processing processes outlined below to turn them into proper source data for labeling.

  • Eliminate errors or noises that occur during data acquisition
  • Standardize the original data to the same standard format
  • Remove unnecessary data
  • Remove duplicate data
  • Perform data de-identification operations to protect privacy,

After that, as in the figure below, data is annotated by providing the ground truth labels for the raw data so that the AI ​​model can make the right decisions (or do analyses).

Figure 2

Finally, the labeled data are reviewed and inspected for incomplete or missing labels. After additional re-labeling and inspection as needed. , the dataset is built and delivered to the client.

Time and cost of building data for AI training

On average, the most time-consuming part of AI development is preparing the dataset. In other words, the time spent in collecting, cleaning, and labeling data accounts for 75% of an AI project. This means that a lot of time and money is incurred in building AI training data.

Figure 3 [3]

Introduction of technology to reduce data processing cost and improve productivity

Testworks has introduced technology to minimize these costs and time.

Preliminary cleaning of data using Python

In the case of data taken through smartphones, rotation information is also included in the images. For instance, an image might be rotated 90 degrees when the picture was taken. The rotated images would be unsuitable for labeling and training. To solve this problem, we remove the rotation information through a python code so that the annotators can work with clean standardized images.

De-identification using AI models

Protecting privacy is important so to remove personally identifiable information from data, the labels must first check each data item one by one and performs de-identification operations. As the amount of de-identification increases, worker fatigue increases, and it can lead to missing de-identifying objects. To solve these problems and increase the efficiency of de-identification tasks, we are utilizing artificial intelligence models that are trained to identify and anonymize target objects.

First, the AI ​​model performs de-identification operations, and then the reviewers inspect and correct the results, and thereby reducing the workload of the annotators, and increasing the overall productivity.

Human-in-the-loop data labeling

Even regular data labeling is not 100% manual.

First, we first manually label 1-5% of the datasets that need to be labeled. We then use the labeled dataset to train an AI model. Then the trained AI model is used to auto label other data.

In case, the data is labeled incorrectly by the auto labeling process, the annotator inspects and corrects the mistakes. When annotation and inspection is complete, the AI ​​model is retrained using the labeled and reviewed dataset.

In the labeling process, we use an AI model trained from a dataset that has been reviewed by humans to go through a semi-automated data labeling process, reducing the burden on the workers, and increasing the accuracy of data labeling.

Figure 4

Conclusion

To build a high-quality dataset for artificial intelligence training, the use of automation technology is required to save the cost. So far, we have only been using auto labeling for datasets that we had sufficient training data. For certain domains, it is hard to developing an auto label model when the training data itself is small. To solve this problem, we are researching ways to augment small training datasets by few-shot learning, GAN, and active learning. I hope that large-scale, high-quality datasets will be built through automation technology, thereby revitalizing the artificial intelligence ecosystem.

References


[1] the NIA – AI training dataset to build Guide .pdf 

[2] Ministry of Science and ICT – Press Release. Refer to the AI data quality standard draft

[3] NIA – Policy direction to improve the effectiveness of AI learning data business. Refer to pdf


Seungwon Lee

Senior Researcher, AI Model Development Team

Sogang University, Bachelor of Computer Science and Engineering

After completing a training class at Testworks, he became interested in social values ​​and joined Testworks. He is very interested in creating social values through technology.