Prof. Dr. Ho-Jin Choi
Korea Advanced Institute of Science & Technology (KAIST),
South Korea
Biography:
Prof. Dr. Ho-Jin Choi is a professor in the
School of Computing at Korea Advanced
Institute of Science and Technology (KAIST),
Daejeon, Korea. He received a BS in computer
engineering from Seoul National University
(SNU), Korea, an MSc in computing software
and systems design from Newcastle
University, UK, and a PhD in artificial
intelligence from Imperial College London,
UK. During 1980’s he worked for DACOM Corp.,
Korea, in later 1990’s he joined with Korea
Aerospace University, before he moved to
KAIST in 2009. In early 2000’s, he visited
Carnegie Mellon University (CMU), USA, and
served as an adjunct faculty for the Master
of Software Engineering (MSE) program
operated jointly by CMU and KAIST for 10
years. In 2010’s he participated research in
Systems Biomedical Informatics Research
Center at the College of Medicine, SNU,
worked with Samsung Electronics on big data
intelligence solutions, and with UAE’s
Khalifa University on intelligent
multi-sensor healthcare surveillance. He
also participated in a Korean national
project called Exobrain for natural language
question/answering. Since 2018, he has been
the director of Smart Energy Artificial
Intelligence Research Center, and since 2020
the director of Center for Artificial
Intelligence Research, both at KAIST. His
current research interests include natural
language processing, machine learning,
explainable AI, and smart energy.
Speech Title: DialogCC for Creating
High-Quality Multi-Modal Dialogue Datasets
For sharing images in instant messaging,
active research has been going on learning
image-text multi-modal dialogue models.
Training a well-generalized multi-modal
dialogue model remains challenging due to
the low quality and limited diversity of
images per dialogue in existing multi-modal
dialogue datasets. In this research, we
propose an automated pipeline to construct a
multi-modal dialogue dataset, ensuring both
dialogue quality and image diversity without
requiring any human effort. In order to
guarantee the coherence between images and
dialogue, we prompt GPT-4 to infer potential
image-sharing moments, e.g., utterance,
speaker, rationale, and image description.
Furthermore, we leverage CLIP similarity to
maintain consistency between aligned
multiple images to the utterance. Using this
pipeline, we introduce DialogCC, a
high-quality and diverse multi-modal
dialogue dataset that surpasses existing
approaches in terms of quality and diversity
in human evaluation. Our experiments
highlight multi-modal dialogue models
trained using our dataset, and their
generalization performance on unseen
dialogue datasets.