TY - JOUR
T1 - TTV
T2 - Towards advancing text-to-video generation with generative AI models and a comprehensive study of model fidelity, performance, and human perception
AU - Onisha, Tasnim Akter
AU - Wimmer, Hayden
AU - Rebman, Carl M.
N1 - Publisher Copyright:
© 2025 International Association for Computer Information Systems. All rights reserved.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - Text-to-video generation has rapidly evolved as a groundbreaking application of generative AI, with the potential to revolutionize both creative and industrial sectors. Despite these advancements, the fidelity, performance, and real-world applicability of current models remain inadequately explored. This research aims to address this gap by evaluating the performance of three cutting-edge text-to-video models: Runway Gen2, CogVideoX-2B, and CogVideoX-5B. The primary objectives of this study are to (1) conduct a comprehensive evaluation of these models using rigorous mathematical assessments such as Frechet Inception Distance (FID), Frechet Video Distance (FVD), and CLIPScore to measure video quality, realism, and alignment with text input; (2) gather human perceptual data to assess perceived realism, quality, and accuracy; and (3) compare the models to identify strengths, weaknesses, and areas for improvement. To uncover how AI-generated videos measure up to human expectations, this study asked 60 participants to rate outputs from three leading text-to-video models using a 7-point Likert scale, 10 diverse prompts, and 10 real-world benchmarks. While CogVideoX-2B impressed with its precision and alignment, CogVideoX-5B stood out for its striking realism in the eyes of human viewers. These findings reveal a compelling trade-off between technical accuracy and perceptual appeal which highlights the need for evaluation methods that balance both.
AB - Text-to-video generation has rapidly evolved as a groundbreaking application of generative AI, with the potential to revolutionize both creative and industrial sectors. Despite these advancements, the fidelity, performance, and real-world applicability of current models remain inadequately explored. This research aims to address this gap by evaluating the performance of three cutting-edge text-to-video models: Runway Gen2, CogVideoX-2B, and CogVideoX-5B. The primary objectives of this study are to (1) conduct a comprehensive evaluation of these models using rigorous mathematical assessments such as Frechet Inception Distance (FID), Frechet Video Distance (FVD), and CLIPScore to measure video quality, realism, and alignment with text input; (2) gather human perceptual data to assess perceived realism, quality, and accuracy; and (3) compare the models to identify strengths, weaknesses, and areas for improvement. To uncover how AI-generated videos measure up to human expectations, this study asked 60 participants to rate outputs from three leading text-to-video models using a 7-point Likert scale, 10 diverse prompts, and 10 real-world benchmarks. While CogVideoX-2B impressed with its precision and alignment, CogVideoX-5B stood out for its striking realism in the eyes of human viewers. These findings reveal a compelling trade-off between technical accuracy and perceptual appeal which highlights the need for evaluation methods that balance both.
KW - CogVideoX
KW - CogVideoX-2B
KW - CogVideoX-5B
KW - Generative AI
KW - Runway Gen-2
KW - TTV
KW - text-to-video generation
KW - text-to-video generative models
KW - transformer models
UR - https://www.scopus.com/pages/publications/105017993322
U2 - 10.48009/1_iis_128
DO - 10.48009/1_iis_128
M3 - Article
AN - SCOPUS:105017993322
SN - 1529-7314
VL - 26
SP - 377
EP - 393
JO - Issues in Information Systems
JF - Issues in Information Systems
IS - 1
ER -