Paper 9

Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames

Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames (NeurIPS 2025) Arnab, Anurag, et al. "Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames." arXiv preprint arXiv:2507.02001 (2025). AbstractVLMs(Vision-Language Models)의 최근 발전에도 불구하고, long-video understanding는 여전히 여려운 문제로 남아 있다. 최신 long-context VLMs는 약 1,000개의 입력 frames를 처리할 수 있지만, 이러한 sequence 길이를 효..

Paper 2026.01.28

Agentic Keyframe Search for Video Question Answering

Agentic Keyframe Search for Video Question Answering (arXiv 2025) Fan, Sunqi, Meng-Hao Guo, and Shuojin Yang. "Agentic Keyframe Search for Video Question Answering." arXiv preprint arXiv:2503.16032 (2025). .AbstractVideoQA(Video Question Answering)은 자연어 상호작용을 통해 video로부터 핵심 정보를 추출하고 이해할 수 있도록 하며, 이는 지능을 달성하기 위한 중요한 단계이다. 그러나 video의 철저한 이해에 대한 요구와 높은 computational cost는 VideoQA의 적용을 제한하고 있다.이를 해..

Paper 2026.01.28

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning (CVPR 2025) Liu, Huabin, Filip Ilievski, and Cees GM Snoek. "Commonsense video question answering through video-grounded entailment tree reasoning." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. . Vol. 39. No. 7. 2025.Abstract이 논문은 commonsense video question answering (VQA)를 위한 최..

Paper 2026.01.06

On the Faithfulness of Vision Transformer Explanations

On the Faithfulness of Vision Transformer Explanaitons (CVPR 2024) Wu, Junyi, et al. "On the faithfulness of vision transformer explanations." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. AbstractVision Transformer를 해석하기 위해서 post-hoc explanations는 input pixels에 중요도 점수(salience scores)를 할당하여 사람이 이해할 수 있는 heatmap을 제공한다. 그러나 이러한 해석이 실제로 model's output의 t..

Paper 2025.10.14

Question Aware Vision Transformer for Multimodal Reasoning

Question Aware Vision Transformer for Multimodal Reasoning (CVPR 2024) Ganz, Roy, et al. "Question aware vision transformer for multimodal reasoning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. AbstractVision-Language models는 multimodal reasoning에서 눈에 띄는 발전을 가능하게 했다. 이러한 architecture는 보통 vision encoder, LLM, visual feature를 LLM's representation spac..

Paper 2025.09.24

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering (TPAMI 2025) Song, Enxin, et al. "Moviechat+: Question-aware sparse memory for long video question answering." IEEE Transactions on Pattern Analysis and Machine Intelligence (2025). Abstract최근 video foundation model과 large language model을 통합하여 video understand system을 구축하면 특정 vision task의 limitation을 극복할 수 있다. 하지만 기존 방법들..

Paper 2025.09.24

MEERKAT: Audio-Visual Large Language Model for Grounding in Space and Time

MEERKAT: Audio-Visual Large Language Model for Grounding in Space and Time (ECCV 2024) Chowdhury, Sanjoy, et al. "Meerkat: Audio-visual large language model for grounding in space and time." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024. AbstractLLM(Large Language Model)의 뛰어난 능력을 활용해서 최근의 MLLM(Multimodal Large Language Model) 연구는 이를 visual, audio와 같은 다른 modalit..

Paper 2025.09.23

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training (CVPR 2025) Qiu, Haiyi, et al. "STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. AbstractVideo-LLMs는 최근 basic video understanding(captioning, coarse-grained question answe..

Paper 2025.09.17

TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision (ICCV 2025) Gupta, Ayush, et al. "TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision." arXiv preprint arXiv:2506.09445 (2025). 들어가기 전에, 'instruction tuning' 의 개념에 대해 알고 있으면 이해하는데 도움이 됩니다!2025.09.17 - [Concept] - Instruction tuningAbstract해당 논문은 video question answering (VideoQA) with temporal grounding 문제를 temp..

Paper 2025.09.16