FineVQ: Fine-Grained User Generated Content Video Quality Assessment

Huiyu Duan1, Qiang Hu1,*, Jiarui Wang1, Liu Yang1, Zitong Xu1, Lu Liu1, Xiongkuo Min1, Chunlei Cai2, Tianxiao Ye2, Xiaoyun Zhang1, Guangtao Zhai1
1Shanghai Jiao Tong University, 2Bilibili Inc.
arXiv Paper Code
Method

We present the fine-grained video quality assessment database and model, termed FineVD and FineVQ, respectively

The rapid growth of user-generated content (UGC) videos has produced an urgent need for effective video quality assessment (VQA) algorithms to monitor video quality and guide optimization and recommendation procedures. However, current VQA models generally only give an overall rating for a UGC video, which lacks fine-grained labels for serving video processing and recommendation applications.

To address the challenges and promote the development of UGC videos, we establish the first large-scale Fine-grained Video quality assessment Database, termed FineVD, which comprises 6104 UGC videos with fine-grained quality scores and descriptions across multiple dimensions. Based on this database, we propose a Fine-grained Video Quality assessment (FineVQ) model to learn the fine-grained quality of UGC videos, with the capabilities of quality rating, quality scoring, and quality attribution. Extensive experimental results demonstrate that our proposed FineVQ can produce fine-grained video-quality results and achieve state-of-the-art performance on FineVD and other commonly used UGC-VQA datasets.




FineVD

An overview of the content and construction process of FineVD. (a) Example videos from our database, which contains both common UGC videos and short-form UGC videos. (b) The illustration of subjective data annotation methods, including both quality scoring and quality attribute labeling processes. (c) The quality-related question-answering pairs generated by GPT-4 and revised by human annotators.

Method





FineVQ

An overview of our proposed FineVQ model. Our model consists of three feature encoders, including an image feature extractor for extracting spatial features from sparse video frames, a motion feature extractor for extracting motion features from the entire video, and a text encoder for extracting aligned text features from prompts. The extracted features are then aligned through projectors and fed into a pre-trained LLM to generate the output results. LoRA weights are introduced to the pre-trained image encoder and the large language model to adapt the models to the quality assessment task.

Method





Performance on FiveVQ

Performance of state-of-the-art models and the proposed FineVQ on our established FineVD database in terms of the quality scoring task.

Method





Performance on Other VQA databases

Performance comparison between state-of-the-art VQA methods and the proposed FineVQ on six UGC VQA databases

Method


BibTeX

@inproceedings{duan2025finevq,
      title={FineVQ: Fine-Grained User Generated Content Video Quality Assessment},
      author={Duan Huiyu, Hu Qiang, Wang Jiarui, Yang Liu, Xu Zitong, Liu Lu, Min Xiongkuo, Cai Chunlei, Ye Tianxiao, Zhang Xiaoyun, Zhai Guangtao},
      booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2025}
}