The Importance of GAN in Creating AI Training Datasets

At the International Space Conference (IAC)[i] held in Guadalajara, Mexico on September 27, 2016, SpaceX and Tesla CEO Alan Musk said, “We will send our first rover to Mars by 2018 and build our first space colony in 2024.” In the movie, Iron Man’s AI assistant Jarvis performs complex tasks ranging from designing, manufacturing, assembling, and driving a robot suit as well as everyday life according to the gestures and voices of ‘Robert Downey Jr.’ With the efforts and imagination of scientists around the world, it seems that we will be able to enjoy the fruits of technology in the not-too-distant future.

In June 2015, immediately after the launch of Google Photos, the news of a picture of a black man living in the United States with a black friend was classified as ‘gorillas’ was revealed, leading to a controversy over ‘racism’. Google promised to correct it immediately, but in the end, the American information technology magazine Wired exposed in 2018 that Google simply resolved the problem by simply deleting the keywords and classifications from the system. In April 2021 (current time), I had a short conversation with an AI speaker that did not work properly except for “Tell me the weather today,” boarded a human-driven ‘non-autonomous vehicle’, and went to work at the Jamsil Testworks headquarters. Still, there is a greater gap between people’s expectations for artificial intelligence and the actual reality [ii]. As a person whose job is to develop an AI model, I feel embarrassed.

[Figure 1] Gartner Hype Cycle shows the progression of technology from expectations to reality

Then, despite the innovative announcements made by world-class scholars and CEOs of global IT companies every year, why hasn’t our life changed so much? Data that is the foundation of AI training is important for the success of the ‘4th Industrial Revolution’ led by AI. In industries that have not yet benefited from the AI innovation, developers tend to say, ‘Give us AI too’, but the Machine Learning engineers respond with ‘Give us data first’. I would like to show that Generative Adversarial Network (GAN) [iii] technology is one way to solve the lack of data problem.

Ian goodfellow, who first proposed the GAN, likened it to a game between the police and a counterfeiter. GAN repeats learning with the goal of generating counterfeit bills that can deceive the police in such a way that it synthesizes the fake data at a level that the police cannot distinguish between real and fake. This technology plays two important roles for data for artificial intelligence training.

First, it solves the problem of sparse data through quantitative and qualitative advancement of data.

In order to build a data set for AI learning, 1) data collection and 2) data annotation are required. If even one of these processes is not complete, it is difficult to learn the expected artificial intelligence model.

1) Examples of collection difficulties

  • When data on special situations such as snow/rain/yellow dust/accidents are needed to improve autonomous driving AI
  • Video data from surveillance cameras for developing a surveillance system of a secure military area.
  • Data of private parts for developing a diagnosis AI for diagnosing sexually transmitted diseases.
  • Not possible to create cancer patients to acquire cancer cell data necessary for training an AI for cancer diagnosis.
  • Difficult to destroy vehicles to obtain damaged vehicle data required for calculating estimates of accidents.

2) Examples of labeling difficulties

  • When processing medical and legal data, it is difficult to label without professional knowledge.
  • When it is necessary to hire and process a native speaker for the evaluation of foreign language pronunciation.
  • Labeling sensor data values ​​that cannot be confirmed by the human eye.
  • When the degree of judgment may be different depending on the subjectivity of the evaluator (ex, character expression, style – Dandy/Chic)
  • When workers may feel disgust during processing (ex, pornography, violent scenes)

By using GAN, a training data set can be built by augmenting a small collected and labeled data set. In the study [iv] conducted in Tel Aviv (Israel), liver lesion data were collected for 6 years in close cooperation with hospitals and medical institutions to train an AI model for diagnosing liver lesions (cysts, metastases, hemangiomas), and a total of 182 images were collected. Just 182 images!

If an AI is trained with only this modest amount of data, not only will it fall into bias and overfitting problems, but it will also be difficult to exhibit robust performance against various new input data. Therefore, they trained it by synthesizing tens of thousands of new data with characteristics of each symptom (cyst, metastasis, hemangioma) using GAN, and obtained performance improvement results (sensitivity). (No Augmentation: 57%, Simple Augmentation: 78.6%, GAN Augmentation: 85.7%)

Simple augmentation refers to image data translation, rotation, flip, and scale processing.

[Figure 2] AI research results for synthetic data-based diagnostics presented by Tel Aviv

Second, solving data usability problems through data privacy protection

GAN can also be used to obtain training data that is easy to collect and label, but difficult to utilize. On July 14, 2020, the Korean government announced the ‘Korean New Deal Comprehensive Plan’[v]. One of the 10 major tasks of this Korean version of the New Deal is a ‘data dam’. A data dam is to store up a wide range of data in a ‘dam’ and to make it available where it’s needed. Currently, ‘de-identification’ plays the role of the dam’s ‘waterway’. Even at this moment, the amount of data collected from CCTVs and vehicle cameras installed everywhere is incalculable. To utilize large-scale data that contains personal information for industry and academic research without restrictions, the data need to be anonymized for both unstructured (face, voice, etc.) and structured (name, address, resident registration number, etc.) data.

However, traditional methods that focus on removing personal information (blurring, or pixelation, etc.) cannot be used for AI research (individual detection/identification, abnormal behavior/situation recognition, emotion recognition, etc.). Therefore, it is necessary to anonymize data in an irreversible form so that they cannot be recognized by human eyes but still can be used with minimal performance degradation for training and testing AI.

Recently, de-identification processing technology through face synthesis is emerging. GAN-based image synthesis technology [vi] such as DeepFake de-identifies personal information by superimposing a virtual face on an original photo or video. Unlike existing methods (blurring, or pixelation, etc.), key information that can be meaningful in R&D research, such as the expression and pose of the original person, is included, but the original person in the image is replaced with a newly synthesized person. Researchers at the Technical University of Munich demonstrated CIAGAN (GAN-based Conditional Identity Anonymization) [vii] technology at CVPR 2020. In addition to creating and synthesizing a person with a new identity, the usability of data was improved through de-identification process designed to selectively adjust key personal information (age, gender, body characteristics, etc.).

[Figure 3] CIAGAN Model of Technical University of Munich (CVPR2020)

In the Testworks R&D lab, I am helping professors, researchers, and CEOs of private, industry, academia, and public institutions complaining of difficulties in securing data sets through AI training through synthetic data generation. Listening to “the Voice of Customers” in the industry/research field, we provide

  1. minimum actual data quantity required for synthetic data generation
  2. time required for data synthesis
  3. synthetic data
  4. quantitative analysis of synthesized data (FID, PSNR, SSIM, etc.)

Recently, I have been examining, discussing, and researching the following technologies with my team members.

  • Domain Adaptation: Data synthesis technology for special situations (snow/rain/night/frost/dust/moisture, etc.)
  • Super resolution: Technology to improve data quality by increasing the resolution of data
  • Semantic Synthesis: How to additionally synthesize a specific object you want
  • Image Inpainting: A technology that removes a specific area within an image and synthesizes it so that there is no sense of heterogeneity.
[Figure 4] Result of Domain Adaptation experiment under study at Testworks
[Figure 5] Super Resolution test result under study at Testworks

Some people are concerned about the various side effects that GAN has triggered that blur the boundary between real and fake. Other people cite, ‘There is more than one way to skin a cat’ and embrace the advantages of fake data. It is expected that GAN will be used appropriately to not only revitalize the data ecosystem but also serve as a catalyst for building datasets for training AI.


[i] IAC, International Astronautical Congress) : https://www.iafastro.org/events/iac/

[ii] Gartner Top 10 Strategic Technology Trends for 2020: https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-technology-trends-for-2020/

[iii] Goodfellow, Ian J., et al. “Generative adversarial networks.” arXiv preprint arXiv:1406.2661 (2014)

[iv] Frid-Adar, Maayan, et al. “GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification.” Neurocomputing 321 (2018): 321-331.

[v] http://www.knewdeal.go.kr/

[vi] Rossler, Andreas, et al. “Faceforensics++: Learning to detect manipulated facial images.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

[vii] Maximov, Maxim, Ismail Elezi, and Laura Leal-Taixé. “Ciagan: Conditional identity anonymization generative adversarial networks.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.


Hyeongbok Kim

Senior Researcher, AI Model Development Team

Harbin Institute of Technology, Computer Science and Technology, PhD Course

He returned to Korea due to Covid-19 during his AI research and is currently working at Testworks AI development team. He is interested in social contribution through technology.