Keynotes 

Prof. Dr. Ho-Jin Choi
Korea Advanced Institute of Science & Technology (KAIST),
South Korea

Biography: Prof. Dr. Ho-Jin Choi is a professor in the School of Computing at Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea. He received a BS in computer engineering from Seoul National University (SNU), Korea, an MSc in computing software and systems design from Newcastle University, UK, and a PhD in artificial intelligence from Imperial College London, UK. During 1980’s he worked for DACOM Corp., Korea, in later 1990’s he joined with Korea Aerospace University, before he moved to KAIST in 2009. In early 2000’s, he visited Carnegie Mellon University (CMU), USA, and served as an adjunct faculty for the Master of Software Engineering (MSE) program operated jointly by CMU and KAIST for 10 years. In 2010’s he participated research in Systems Biomedical Informatics Research Center at the College of Medicine, SNU, worked with Samsung Electronics on big data intelligence solutions, and with UAE’s Khalifa University on intelligent multi-sensor healthcare surveillance. He also participated in a Korean national project called Exobrain for natural language question/answering. Since 2018, he has been the director of Smart Energy Artificial Intelligence Research Center, and since 2020 the director of Center for Artificial Intelligence Research, both at KAIST. His current research interests include natural language processing, machine learning, explainable AI, and smart energy.

Speech Title: DialogCC for Creating High-Quality Multi-Modal Dialogue Datasets
For sharing images in instant messaging, active research has been going on learning image-text multi-modal dialogue models. Training a well-generalized multi-modal dialogue model remains challenging due to the low quality and limited diversity of images per dialogue in existing multi-modal dialogue datasets. In this research, we propose an automated pipeline to construct a multi-modal dialogue dataset, ensuring both dialogue quality and image diversity without requiring any human effort. In order to guarantee the coherence between images and dialogue, we prompt GPT-4 to infer potential image-sharing moments, e.g., utterance, speaker, rationale, and image description. Furthermore, we leverage CLIP similarity to maintain consistency between aligned multiple images to the utterance. Using this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal dialogue dataset that surpasses existing approaches in terms of quality and diversity in human evaluation. Our experiments highlight multi-modal dialogue models trained using our dataset, and their generalization performance on unseen dialogue datasets.