ynnnxxi's 개 빡센 하루 시작 ❤︎

Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames

ynnnxxi — Wed, 28 Jan 2026 19:14:54 +0900

Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames (NeurIPS 2025)

Arnab, Anurag, et al. "Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames." arXiv preprint arXiv:2507.02001 (2025).

Abstract

VLMs(Vision-Language Models)의 최근 발전에도 불구하고, long-video understanding는 여전히 여려운 문제로 남아 있다. 최신 long-context VLMs는 약 1,000개의 입력 frames를 처리할 수 있지만, 이러한 sequence 길이를 효과적으로 활용하는 데에는 여전히 어려움을 겪고 있으며, context window 내의 관련 없는 방해 요소들에 의해 성능이 저하된다. 해당 논문에서는 video question-answering을 위한 추론 전략인 Temporal Chain of Thought를 제안하며, 이는 model의 입력 context를 선별한다. VLM 자체를 사용하여 video로부터 가장 관련성이 높은 frames를 반복적으로 식별하고 추출하며, 이렇게 선택된 frames를 답변 생성에 사용한다.

추론 시점에서 더 많은 연산을 활용하여 가장 관련성 높은 context를 선택하는 것이 정확도 향상으로 이어짐을 보이며, 이는 LLMs의 추론 시점 스케일링에 관한 최근 연구 결과와 일치한다. 제안하는 방법은 4개의 다양한 video question-answering datasets에서 SOTA를 달성하였으며, 서로 다른 3개의 VLM 전반에 걸쳐 일관된 성능 향상을 보여준다. 특히 해당 방법은 그렇지 않으면 model의 context window에 포함될 수 없는 더 긴 video에서 두드러진 성능을 보인다.

Introduction

Figure 1: Temporal Cahin of Thought. Motivated by the fact that long input contexts can have distractors which confuse the model, we use the VLM itself to first extract relevant context (blue box) before processing it. Our approach improves accuracy, and by iteratively processing parts of the video at a time, can also reduce the model's required context window.

VLMs의 최근 발전에도 불구하고, understanding long videos는 여전히 어려운 문제로 남아 있다. 이러한 어려움은 이 task가 VLMs이 긴 input token sequence를 처리하도록 요구하며, 동시에 action and scene understanding, long-term memory, tracking sate changes, interactions among others 서로 밀접하게 연관된 다양한 능력을 갖추도록 요구하기 때문이다. 수백 개 혹은 천 개에 달하는 input frames를 처리할 수 있도록 하는 VLMs의 장문 context 처리 능력은 이러한 측면에서 중요한 진전이다. 그러나 다수의 연구들은 더 긴 context를 처리하는 것이 model이 관련 없거나 오해를 불러일으키는 contents에 압도되어 정확도가 포화되거나 오히려 저하될 수 있음을 보여준다.

Input context가 지나치게 클 경우, 오히려 방해 요소가 될 수 있다는 관찰에 기반하여, 해당 논문에서는 Temporal Chain of Thought라는 추론 전략을 제안한다. 이 전략은 먼저 input video로부터 관련 있는 context를 집계한 후, 이를 사용하여 question에 답한다. video에서 방해가 되는 context를 제거한다는 원칙에 기반한 기존 연구들은 일반적으로 여러 models로 구성된 앙상블을 사용하였으며, 보통 하나의 model은 개별 frame을 caption화 하고, 다른 하나는 관련 frame을 찾은 뒤, 마지막으로 LLM을 사용해 질문에 답하는 방식을 취하였다. 반면에 해당 연구에서는 단일 VLM만을 사용하여 관련 context를 선택하고 question에 답하며, 이러한 추론 전략이 상당한 성능 향상을 제공함을 보인다.

제안하는 접근법은 추론 시점에서의 연산량을 확장하는 것이 model parameters를 확장하는 것보다 더 효과적이라는 최근의 LLMs 연구들에서 영감을 받았다. 이와 유사하게, video로부터 관련 정보를 집계하기 위해 더 많은 연산을 활용하는 것이 더 높은 정확도로 이어짐을 보인다. 제안하는 접근법은 video로부터 관련 context를 반복적으로 추출하기 때문에, 그렇지 않으면 model의 context 한계 내에 들어오지 못했을 videos도 효과적으로 처리할 수 있다. 더 나아가, 이 접근법은 언어 영역에서의 Chain-of-Thought prompting과도 연결된다. 해당 방식에서는 model이 최종 답변을 생성하기 전에 이를 돕는 text 기반의 thoughts를 먼저 출력하도록 유도한다. video에서 관련 frame을 집계하는 과정에서, 이러한 frames를 visual thoughts로 볼 수 있다. 또한 부수적인 결과로, 관련 frame을 선택한 model의 정당화 근거를 활용하여 model을 해석할 수도 있다.

Video로부터 관련 context를 집계하는 일반적인 원칙이 video question-answering에 유익함을 확인하였으며, 제안한 방법이 4개의 datasets과 3개의 서로 다른 VLM 전반에 걸쳐 일관되게 성능 향상 시킴을 보였다. 수백 frames 규모의 짧은 video의 경우, 전체 video가 model context window에 들어갈 수 있음에도 불구하고, 이 추론 전략은 입력에서 방해 요소를 제거함으로써 model의 추론 능력을 향상시켜 성능을 개선한다는 점을 보여준다.

Contribution

A novel VLM inference strategy for Video QA
context 집계 원칙이 효과적임을 확인하는 철저한 실험 분석을 수행, 제안한 접근법이 다양한 유형에 적용 가능하며, 여러 VLM으로 일반화됨을 보임
4개의 video understanding benchmarks에서 SOTA를 달성함

Proposed Approach

Figure 2: Temporal Chain of Thought. We use Single-Step TCoT(left, Sec. 3.2.) to construct our final approach (right). Namely, we use the VLM itself to extract relevant frames from an input video clip, conditioned on the input question. To scalably process longer videos, we perform this approach within l segments which span the video to extract the most relevant context. Finally, we use only the extracted context for answering

<Single-Stage Temporal Chain of Thought> (left)

Input: N개의 frames & question (text)
Frame Selction prompt: 답을 먼저 말하지 말고, 필요한 frame ID 목록을 먼저 출력
Output:
- question과 관련 있다고 판단한 frame 번호들의 list + Justification (왜 그 frames 관련 있는지에 대한 근거)
- "답을 생성하기 전에, VLM 스스로 관련 frame index (visual evidence)를 고르게 만든다"

<Dynamic-Segment Temporal Chain of Thought> (right)

Decompose video into segments
- video 전체를 l개의 segments로 나눔
- 각 segment는 video의 연속된 시간 구간
각 segment를 독립적으로 Single-Stage TCoT로 처리
- 각 segment에서 frame 일부를 sampling한 뒤, 이 frames와 question을 넣어서 해당 segment에서 중요한 frames만 선택
Frame aggregation
- segment별로 뽑힌 중요 frames를 모아서 최종적으로 question과 관련된 핵심 context(frames 묶음)을 만듦
Answer 단계

Figure 3: Prompt for our VLM selection call, S, (Eq. 4).

Standard VLM inference

Input video에 대한 question에 답하기 위한 VLM의 표준 추론 방식은 이를 단순히 model에 전달하는 것이다.시각적 입력은 일반적으로 language token과 동일한 space로 투영되거나 tokenizer된다. model의 전체 sequence 길이는 연산량에 의해 제한되기 때문에, video의 frames은 보통 model의 context 한계 안에 들어오도록 subsampling 되어야 한다. 가장 긴 context window를 가진 현재의 model은 일반적으로 1fps 기준으로 최대 1시간 길이의 video를 처리할 수 있다.

Temrporal Chain of Thought

제안하는 방법은 VLMs가 점점 더 큰 input context 길이를 처리할 수 있게 되었음에도 불구하고, 여전히 이를 효과적으로 활용하지 못하고, 큰 context 안에 포함된 관련 없는 방해 요소들로 인해 혼란을 겪는다는 사실에서 출발한다. 입력 video에 대한 question에 대해 답하기 위해서, 두 입력을 그대로 모델에게 직접 전달하지 않는다. video question-answering을 먼저 video로부터 관련 context를 추출한 뒤, 이를 사용해 question에 답하는 2 stage로 분해한다. 이 분해 과정 이후 답변을 수행할 동일한 instruction-tuned VLM에 의해 수행된다. 이는 Chain-of-Thought 및 관련 LLM 추론 전략들에서 영감을 받는 것이다.

Single-Step TCoT

이 간단한 접근법(Figure 2, left)는 최종 방법의 기반이 되며, VLM에게 question에 답하기 위해 어떤 frames가 필요한지 직접 질의하는 방식이다.

Dynamic-Step TCoT

video 길이를 model의 context 한계로부터 분리하고, model의 context 한계에 맞추기 위해 input frames를 uniformly sampling할 때 발생하는 frame recall의 한계를 극복하기 위해, Dynamic-Segment TCoT가 설계되었다. Figure 2에 나타난 바와 같이, video를 l개의 분리된 segment로 나누고, 이를 독립적으로 처리한 뒤, 동일한 길이의 l개의 겹치지 않는 segment로 나눈다.

Discussion

video를 l개의 segment로 분할함으로써, video 길이와 무관하게 고정된 계산 비용으로 long-video를 처리할 수 있다. 표준 VLM 추론의 경우, 계산 비용은 video 길이에 따라 증가하며, 최대 지원되는 context 한계에 의해 제한된다. 반면, 이 접근법에서는 필요한 context 길이가 항상 고정되어 있으며, 총 context 길이 x segment 갯수의 frame을 처리하게 된다.

segment 수를 조절함으로써, 추론 시 계산량과 정확도를 모두 부드럽게 증가시킬 수 있다. 이러한 경향은 추론 시점에 추가적인 연산을 사용하여 LLM으로 어려운 문제를 해결하는 최근의 언어 분야 연구들과도 일치한다.

Experiments

Table 3: State-of-the-art commparison. We report our won Gemini 1.5 Flash baseline for Egoschema and LVBench as it outperforms [54, 57]. For LVBench and OpenEQA, we report the tokens used in a single context-window, and the total number of tokens processed. OpenEQA uses the "LLM-as-judge" protocol of using GPT-4 evaluate the answer. †: Our reproduction.

"input video에서 관련 context를 추출하지만, 중간 표현으로 caption을 사용하지 않고 video frame에 직접 작동한다는 점에서 근본적으로 다르다. frame 별 초기 caption에 의존하지 않기 때문에, 이 접근법은 captioner가 question과 관련된 세부사항을 놓치는 문제에 의해 제안되지 않는다. (특히 이들 연구에서 captioning은 input question에 조건화되어 있지 않기 때문이다)"

라고 논문에 나와있는데, 이 부분을 잘 참고하면 좋을 것 같다. 이 Chain-of-Thought 방식이 정말 효과적인 것인지, 아니면 VLM의 효과가 좋은 건지 비교하는 실험이 있으면 좋을 것 같다.

Agentic Keyframe Search for Video Question Answering

ynnnxxi — Wed, 28 Jan 2026 14:32:22 +0900

Agentic Keyframe Search for Video Question Answering (arXiv 2025)

Fan, Sunqi, Meng-Hao Guo, and Shuojin Yang. "Agentic Keyframe Search for Video Question Answering." arXiv preprint arXiv:2503.16032 (2025). .

Abstract

VideoQA(Video Question Answering)은 자연어 상호작용을 통해 video로부터 핵심 정보를 추출하고 이해할 수 있도록 하며, 이는 지능을 달성하기 위한 중요한 단계이다. 그러나 video의 철저한 이해에 대한 요구와 높은 computational cost는 VideoQA의 적용을 제한하고 있다.이를 해결하기 위해서 해당 논문에서는 VideoQA task에서 핵심 frame을 식별하기 위한 간단하고 강력한 알고리즘인 Agentic Keyframe Search (AKeyS)를 제안한다. AKeyS는 현대의 언어 에이전트를 활용하여 고전적틴 탐색 알고리즘을 지시함으로써 중복되고 관련 없는 content로부터 핵심 정보를 효과적으로 구별할 수 있다.

먼저 video를 분할하고 이를 tree 구조로 구성한다. 이후 AKeyS는 language agent를 사용하여 node를 동적으로 확장하는 동안 heuristic과 movement cost를 추청한다. 마지막으로 agent는 종료 조건에 기반하여 충분한 핵심 frame이 수집되었는지를 판단하고 answer를 제공한다.

EgoSchema & NExT-QA datasets에 대한 실험 결과, AKeyS는 가장 높은 핵심 frame 탐색 효율로 기존의 모든 method를 능가함을 보여준다. 이는 최소한의 computational overhead로 핵심 정보를 정확히 식별하고 효과적인 시각적 추론을 수행할 수 있음을 의미한다.

EgoSchema subset에서 AKeyS는 VideoTree와 비교하여 전체 frame의 43.5%만을 처리하면서도 1.8% 더 높은 정확도를 달성함을 보여준다.

Introduction

Figure 1. Demonstration of AKEYS's high frame efficiency. Whem processing the same number of video frames with the same (M)LLM, AKEYS achieves higher QA accuracy. At the same accuracy level (66%), AKEYS uses only about 1/4 of the frames required by VideoTree. Moreover, VideoTree clusters features of all frames during preprocessing, whereas AKEYS only has access to visible frames and does not uilize information from the rest. This experiment is conducted on EgoSchema subset.

MLLMs(Multimodal Large Language Models)의 급속한 발전은 일상생활에서의 image understanding task를 크게 단순화하였다. 사용자는 image를 GPT-4V나 Gemini에 쉽게 업로드하고, 이에 대해 질문을 던진 뒤, 자연어 상호작용을 통해 응답 받을 수 있다. 그러나 video understanding는 더 큰 도전이며, Video-LLMs(Video Large Language Models)는 종종 video의 세부 정보를 포착하는 데 어려움을 겪고, video contents에 대한 전체적인 이해가 부족하다. 또한 Video-LLMs의 computational cost는 LLMs이나 image 기반 MLLMs에 비해 훨씬 높아, 상용 배포를 저해하는 요인이 된다. 일상생활에서의 video understanding task를 보다 효과적으로 해결하기 위해, 해당 논문은 keyframes의 효율적인 추출에 초점을 맞추고, 이를 image 기반 MLLMs를 활용하여 분석함으로써 video understaning을 수행한다.

Figure 2. Comparison of three methods for analyzing a travel vlog: (1) Video-LLM can generate correct answers but is highly token-intensive; (2) The method of uniform frame sampling may introduce irrelevant content, leading MLLM to incorrect predictions; (3) The method of keyframe sampling for MLLM achieves both accuracy and efficiency. The keyframes relevant to the given question are highlighted in the figure.

핵심 frame 추출의 주요 장점 중 하나는 필수적인 정보를 유지하면서도 computational overhead를 크게 줄일 수 있다는 점이다. Figure 2를 보면, 이 중 핵심 frame sampling 기반 방법만이 정확도와 효율성을 동시에 달성하며, 이는 VideoQA task에서 핵심 frame의 중요성을 강조한다. 그러나 중요한 challenge는 특정 question에 답하는 데 필요한 필수 정보를 포함하는 핵심 frame을 어떻게 효과적으로 식별할 것인가이다. 이러한 문제는 long-form video understanding의 맥락에서 더욱 두드러진다. 방대한 양의 관련 없는 정보 속에서 question에 기반하여 핵심 content를 정확하게 시간적으로 국소화해야 하기 때문이다. 효율성과 정확성을 모두 만족하는 핵심 frame 위치 추정 문제를 해결하는 것은 video understanding task에서 매우 중요하다.

해당 논문에서는 VideoQA task로 대표되는 video understanding 및 analyze 문제를 해결하기 위해 AKeyS라는 효율적인 알고리즘을 제안한다. 전통적인 탐색 알고리즘과 현대 언어 에이전트로부터 영감을 받은 본 접근법은 reasoning, planning, summarization, reflection과 같은 언어 에이전트의 인지적 능력을 활용하여 전통적인 탐색 알고리즘을 안내하고 피드백을 제공한다. 이러한 방법론은 중복된 정보로부터 핵심 content를 효과적으로 추출한다.

video가 주어지면 AKeyS는 이를 여러 segments로 나누고, 각 segment의 대표 frame으로부터 image captioner와 같은 VLM(Vision Language Model)을 사용하여 text information을 추출한다. 이후 언어 에이전트를 활용하여 시간적 비교를 수행하고, 종료 조건에 도달할 때까지 반복적이고 점진적인 과정으로 핵심 content를 식별한다. 이 과정은 question에 답하기에 충분한 핵심 정보가 발견될 때까지 video 전반에 걸쳐 tree 구조의 탐색을 수행하는 형태로 진행된다.

Method

Background: Vasic Searching Algorithms

AKeyS algorithm은 Algorithm 1에 제시된 기본 탐색 알고리즘을 기반으로 구축된다. 이 기본적인 과정에 따라, 탐색 알고리즘들은 node를 선택하기 위한 우선순위를 결정하는 방식에 따라 구분된다.

DFS (Depth-First Search): 더 깊은 깊이를 가진 node를 우선시하며, 되돌아가기 전에 가능한 한 멀리까지 탐색
BFS (Breadth-First Search): 다음 단계로 이동하기 전에 현제 level의 모든 이웃 node를 탐색
GBFS (Greedy Best First Search): heuristic 평가 함수 h(n)을 cost function으로 사용함(f(n) = h(n)). h(n)은 현재 node에서 목적지까지의 cost를 나타냄. 이는 탐색 알고리즘을 목적지 방향으로 유도할 수 있지만, 최적 경로를 보장하지는 않음
Dijkstra's Algorithm: movement cost function g(n)을 cost function으로 사용함(f(n) = g(n)). g(n)은 시작점에서 현재 node까지 이동하는 cost를 의미함. 이는 edge의 weight를 고려하여 시작 node로부터 모든 다른 node까지의 최단 경로를 찾음
A*: Dijkstra's Algorithm + GBFS. cost funtion f(n) = g(n) + h(n). 이는 효율성과 최적성 사이의 균형을 이루어 경로 계획에서 매우 효과적임

AKEYS Algorithm

Search Objective

AKeyS에서 keyframe은 question과 관련된 핵심 정보를 포함하는 frame으로 정의된다. 탐색 목표는 해당 frame들의 결합된 정보가 question 에 답하기에 충분한 keyframe 집합을 식별하는 것이다.

VideoQA task를 위해 MLLMs를 사용할 때에도 keyframe이 아닌 frame을 제거하고 다음 두 가지 접근법 중 하나를 선택할 수 있다. 두 접근법은 본질적으로 동일하며, keyframe에 포함된 정보와 model이 학습한 사전 지식에 의존한다.

keyframe image base MLLM에 직접 입력하여 답변 생성
BLIP과 같은 VLM을 적용하여 keyframe에 대한 caption을 생성한 후, 해당 caption을 이용해 답변 도출

Nodes

AKeyS algorithm에서는 video를 여러 개의 segments로 나누며, 각 video segment는 하나의 node를 나타낸다. 초기 node는 전체 video이며, 이는 먼저 M(hyperparameter)개의 segment로 균일하게 분할된다. (실제 code에서는 10개의 frame을 하나의 segment로 지정. frame 갯수는 video의 길이와 동일 (1 FPS sampling)) 다음에 확장될 node(다음으로 처리될 video segment)는 정의한 cost function f(n)에 따라 선택된다. (실제 code에서는 A* algorithm을 사용) 확장 과정이란 선택된 video segment를 더 세분화하는 것을 의미한다. 본 연구에서는 node expension을 위해 해당 segment를 binary split한다.

Answer Prediction

현재의 모든 video segment의 first frame과 end frame을 Visible Frames Fv로 정의한다. 이 frames은 서로 연결되어 있는데, 즉 하나의 video segment의 마지막 frame이 다음 segment의 첫 frame이 된다. Visible Frames에 포함된 정보를 충분히 활용할 수 있지만, 나머지 frame의 정보는 일시적으로 접근할 수 없다. Visible Frames에 대해서는 다음 두 가지 접근 중 하나를 사용할 수 있다.

frame을 MLLM에 직접 입력
먼저 caption을 생성한 뒤, text modality에서 reasoning을 수행

어떤 방식이든 Visible Frames의 정보를 기반으로 답변을 예측한다. 본 연구에서는 2번 접근법을 선택한다. 예측된 답변은 중간 단계에서의 잠정적인 추측이며, 탐색이 진행되고 더 많은 Visible Frames이 드러남에 따라 변경될 수 있다. 종료 조건이 충족되면 탐색 과정은 종료되고, 예측된 답변이 최종 답변이 된다. Visible Frames의 총 개수는 QA system의 frame 효율성을 나타내는 척도로 사용된다. Visible Frames이 적을수록 MLLMs가 처리해야 할 image 수가 줄어들어 효율성이 높아진다. 최종 Visible Frames은 탐색 과정을 통해 얻어진 keyframe을 의미한다.

Figure 3. Illustration of AKEYS's cost function evaluation and node expansion steps.

Cost Function

AKEYS-GBFS

언어 에이전트가 현재 Visible Frames의 정보를 평가하고 question에 답하기 위해 어떤 시각 정보가 누락되어 있는지를 식별하도록 함
이 누락된 정보는 현재 node와 목적지 사이의 거리로 간주함
가장 작은 h(n)을 가진 node를 확장 대상으로 선택
누락된 시각 정보가 어느 두 특정 inVisible Frames 사이에 위치할 가능성이 높은지를 식별 → 어떤 video segment를 확장해야 하는지를 결정

AKEYS-DIJKSTRA

cost function g(n)은 시작점에서 현재 node까지 이동하는 비용
언어 에이전트가 현재 Visible Frames의 정보를 평가 → 어떤 video segment가 가장 두드러진 장면 변화를 보이는지를 식별
다수의 장면 전환을 포함하는 장편 video를 segment로 만들고 keyframe을 추출할 때, 이상적인 상황은 각 장면을 개별 segment로 취급하는 것 → video의 시각적 요소들이 visible frames 내에서 겹치거나 누락되지 않도록 보장하며, 필요한 Visible Frames 수를 최소화하고 효율성을 극대화 함
cost function은 목적지의 위치를 고려하지 않으며, question은 언어 에이전트에게 보이지 않

AKEYS-A*

cost function f(n) = h(n) + g(n)
현재 node에서 목적지까지의 거리와 시작점에서 현재 node까지의 거리를 모두 고려함
어떤 video segment가 누락된 정보르 포함할 가능성이 높은지 + 어떤 video segment가 가장 두드러진 장면 변화를 보이는지 → 두 조건 모두 만족하는 video segment 확장 대상 우선시됨

AKEYS-BFS

cost function을 평가하기 위해 언어 에이전트에 의존하지 않는 단순한 알고리즘
BFS를 수행하여(가지치기가 없는 경우) 존재하는 모든 video segmens를 지속적으로 분할
언어 에이전트에 접근할 수 없거나, LLM으로 인해 발생하는 overhead가 덜 중요한 상황에서 정보 누락이 없도록 보장하는 데 더 큰 비중을 두는 경우에 적합

Termination Condition

전통적인 탐색 알고리즘은 일반적으로 탐색 목표에 도달했는지 여부와 같은 결정론적인 종료 조건을 가진다. 하지만 VideoQA를 위한 keyframe search algorithm에서는 종료 조건이 훨씬 더 모호하고 정의하기 어렵다. 충분한 정보가 수집되었는지, 혹은 핵심 정보가 누락되었거나 과도한 추론(over-inference)이 발생했는지를 판단하는 것은 어렵다. 언어 에이전트의 relection, summarization, self-evaluation 능력에서 영감을 받아, 기본 LLM을 사용하여 예측된 답변에 대한 신뢰도를 평가하고, 이에 따라 탐색을 종료할지 여부를 결정한다. 이러한 방식으로 AKEYS는 충분히 신뢰할 수 있는 예측이 이루어졌을 때 종료된다.

두 가지 신뢰도 평가 방법을 투표 machasim을 통해 결합한다.

Self-Evaluation and Self-Reflection

LLM은 자신의 응답을 스스로 평가하고 잠재적인 결함을 반성하도록 지시될 수 있음
답변을 생성한 후 question, Visible Frame의 정보, LLM의 이전 추론 체인과 예측된 답변을 모델에 다시 입력함
이후 LLM은 자신의 이전 답변의 정확성과 신뢰성을 평과하고 신뢰도 첨수(c1)을 출력함

Temporal Summarization

sampling된 frame의 caption은 이산적임. 이를 시간적 차원에서 통합하기 위해 LLM에게 해당 captions를 요약하여 video에 대한 일관된 개요를 형성하도록 지시
이 요약을 기반으로 LLM이 답변을 예측하고 신뢰도 점수(c2) 출력

이 두 가지 방법을 앙상블하기 위해 투표 메커니즘을 사용한다. 두 방법이 독립적으로 모두 충분한 신뢰도를 가진다고 판단할 때(c1과 c2 모두가 threshold 이상) 탐색 과정이 종료된다.

Experiments

Table 1. Comparison between AKEYS and other methods. We highlight the gain of our method over VideoTree in blue.

Table 2. Ablation on basic search algorithms. We highlight the improvement of AKEYS-A* over the naive AKEYS-BFS in the table, emphasizing the role of the cost function evaluation.

Table 3. Ablation on termination condition

Table 4. Ablation on different base LLMs

이 연구는 VideoQA task의 reasoning, inference 모두 LLM을 사용했다. 여기서 사용한 reasoning 방법이 좋다고 생각이 되긴 하지만, 실제 code를 확인해봤을 때 default iteration의 기준이 무엇인지 조금 모호한 것 같다는 생각을 했다. 각 step에서 threshold를 넘지 않는 답변이 나온다면 최대 5번 iteration을 하게 되고, 이후에도 답이 나오지 않는다면 또 다른 step을 반복하는데, 이 과정이 너무 과하다고 생각했다. 단계가 1_s_r → 2_s_r → 3_s_r → 4_s_r → 5_s_r → final_direct_qa → post_s_r 7개로 구성되어 있고, 어떤 단계에서 종료될 지는 모른다. 이 부분을 조금 더 정리해서 체계적으로 step을 나눈다면 더 좋지 않을까라는 생각을 했다.

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

ynnnxxi — Tue, 6 Jan 2026 19:50:46 +0900

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning (CVPR 2025)

Liu, Huabin, Filip Ilievski, and Cees GM Snoek. "Commonsense video question answering through video-grounded entailment tree reasoning." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. . Vol. 39. No. 7. 2025.

Abstract

이 논문은 commonsense video question answering (VQA)를 위한 최초의 video-grounded entailment tree reasoning 방법을 제안한다. VLMs이 주목할 만한 발전을 이루었음에도, video와 그럴듯한 answer 사이의 suprious correlations을 학습하고 있다는 우려가 커지고 있으며, 이는 VLMs의 black-box 특성과 남아있는 benchmarking의 편향에 의해서 강화된다.

이 논문의 방법은 video fragments를 4단계에 걸쳐서 VQA task를 풀어낸다.

entailment tree construction
video-language entailment verification
tree reasoning
dynamic tree expansion

이 방법의 핵심적인 장점은 전반적인 reasoning types에서 VLMs에 일반화가 가능하다는 점이다.

공정한 평가를 하기 위해서, 해당 논문에서는 LLMs을 기반으로 VQA benchmark answer sets를 재작성하여 model reasoning을 강화하는 de-biasing 절차를 고안한다. 기존 benchmark와 de-biased benchmark에 대한 체계적인 실험을 통해서 benchmark, VLM, reasoning types 전반에 걸쳐서 제안하는 방법의 요소들이 미치는 영향을 나타낸다.

Introduction

Figure 1. Given a video questioning answering task, our framework performs explicit reasoning over an entailment tree, where answer options are transformed into statements. These statements are then recursively decomposed and verified based on video-grounded evidence relevant to the question.

Input: video + question (multi-choice)
Visual Evidence Grounding
- '무엇을 봐야 하는지' video에서 찾기
- 검증을 위해 video에서 근거가 되는 frame 구간을 찾아서 alignment
- toy movement → 파란 박스 구간
- boy shake leg → 빨간 박스 구간
선택지 → 문장(가설) 변환
Decomposition: 문장을 더 작은 하위 문장(가설)들로 분해
Statement Verification: 각 하위 문장을 video로 검증
최종 결론: 검증된 추론 사슬로 (a) 선택

본 논문은 commonsense video question answering (VQA)를 위한 video-grounded reasoning 방식을 제안한다. 최근 VQA는 VLMs를 통해 발전을 이루었다. 그러나 이러한 성능 향상이 reasoning에 기반한 것이 아니라, video와 likely answers 사이의 shortcut associations를 학습한 결과라는 우려가 커지고 있다. 이러한 우려는 모델의 black-box적 특성에 의해 더욱 강화되며, 이는 의사 결정 과정의 깊은 이해를 어렵게 만든다.

저자들은 자연어 처리(Natural Language Processing) 분야의 최근 연구에서 영감을 받았다고 한다. 이 분야에서는 함의 트리(entailment trees)가 답변 후보를 명시적으로 분석하기 위한 메커니즘으로 등장하였다. LLMs를 사용하여 하나의 후보를 재귀적으로 가설들(hypotheses)로 분해하고, 자연어 추론 형식을 통해 이러한 가설들을 평가한다. Entailment trees는 모델의 의사 결정 과정을 설명하는 명시적인 reasoning chain(추론 사슬)을 제공하고, 각 단계를 검증할 수 있게 하여 shortcut learning에 대한 우려를 해결했다.

제안하는 방법은 VQA task를 4단계에 걸쳐서 video fragment를 명시적으로 정렬한다.

entailment tree construction
video-language entailment verification
tree reasoning
dynamic tree expansion

Fig 1과 같이, video와 multiple-choice question이 주어지면 각 정답 후보에 대해 1단계 가설로 작용하는 문장을 생성한다. 각 문장을 반복적으로 분해하여, video에서 신뢰성 있게 검증할 수 있는 하위 문장을 생성하는 것을 목표로 한다. video는 frame 집합으로 구성된 partition들로 분해된다. 각 문장을 검증하는 것은 해당 문장을 video partition에 정렬(grounding)하는 문제로 귀결된다.

이 방법의 핵심적인 장점은 temporal and causal reasoning을 포함한 다양한 reasoning types 전반에 걸쳐서 현재의 video & image 기반 VLMs에 일반화 가능하다는 점이다. Video reasoning 능력을 입증하기 위해, LLMs의 지원을 받아 VQA benchmark가 surprious correlation에 의존하지 않고 video내 reasoning에 적합하도록 보장하는 정답 집합 de-biasing 절차를 개발한다.

실험 결과, video-grounded entailment tree method는 기존 benchmark와 de-biased benchmark 모두에서 video-and image-based baseline을 모두 일관되게 향상시킨다.

Video-grounded entailment tree reasoning & De-biasing commonsense VQA answer sets

Figure 2. Overview of our framework. (a) The generation of the entailment tree, where statements are rercursively decomposed until the tree reaches its amx depth or meets the stop criterion. (b) The process of video-language entailment verification: the input video is first converted into textual descriptions. Each caption is then parsed into structured semantics. Given the fact statement as a query, we retrieve the anchor frame. Then, based on the temporal or causal navigation indicated by questions, the visual evidence moment can be grounded.

(a) Entailment Tree: 정답 후보를 '증명 가능한 문장'으로 만들고, 계속 쪼개는 구조

Statement(1st level): Q&A → 문장으로 변환
- 선택지를 그대로 쓰지 않고, Statement(1st level)로 바꿈
Sub-statement(2nd level): 'after' 같은 시간관계를 분해해서 검증 가능한 조각으로 쪼갬
- 가운데 파란 박스 (Fact statement) ☞ (b) 에서 나오는 Fact statement(F)로 연결됨
- 양쪽의 노란 박스 ☞ 각 선택지의 핵심 행동만 남긴 문장
Sub-statement(3rd level): 더 원자적으로 쪼개서 video로 판단하기 쉽게 만듦
- 빨간 점선 박스들처럼 더 단순한 사건 단위로 쪼갬
- 0.6 ☞ (b)에서 Prover가 산출하는 신념/확률(점수)의 예시
Sub-statement(4th level): 최소 단위까지 내려가는 경우
- "stop criterion/최대 깊이"까지 가면서 검증 가능한 단위로 최대한 단순화함

(b) Video-language Entailment Verification: 문장 검증을 전체 video가 아니라 필요한 구간에서만 하도록 유도

Captioner (Cap): video frame → caption(text)로 변환 + 질문의 사실(F) 함께 줌
- video frame이 Captioner로 들어감
- Fact statement(F) ☞ 질문이 가리키는 핵심 사실을 captioner에 조건/힌트로 제공
- 이 결과 Raw captions에 frame 별 설명이 나오고, 이걸 바로 사용하지 않고 다음 단계(semantic parsing)로 감
- semantic parsing 단계에서 '주어-관계-목적어' 같은 구조로 바꿈
Retrieval (Rtv): Anchor frame 찾기
- Query는 Fact statement에서 나온 구조화 의미를 이용
Ground (Gnd): 질문이 'after'면 anchor 이후를, 'before'면 anchor 이전을, 아무것도 아니면 주변을 봄
- 정답을 판별할 근거는 전체 영상에 흩어져 있지 않고, 질문이 요구하는 시간관계에 따라 특정 구간에 있을 것
Prover (Prv): M 구간을 보고 문장(statement)이 참인지 거짓인지 점수로 판단
- Evidence 구간 M을 입력으로 받아 statement의 참/거짓을 평가함

Entailment tree construction

Initial statement generation

질문과 그에 대한 정답 후보들이 주어지면, 먼저 각 question-answer pair를 원래 QA pair의 의미를 유지하는 declarative sentence로 변환한다. 최적의 정답을 선택하는 것은 주어진 video에 대해 올바른 문장을 식별하는 것과 동일하다.

Recursive statement decomposition

문장 집합에 포함된 각 초기 문장에 대해, 해당 문장을 지지하는 증거로서 두 개의 하위 문장을 생성한다.

Statement ⇐ Sub-statement₁, Sub-statement₂

이 문장(Statement)은 두 하위 문장이 모두 참으로 입증될 때에만 참이 되며, 하위 문장들이 상위 문장을 함위한다. 이 절차는 재귀적이며, 하위 문장들은 다시 이를 함의하는 추가적인 하위 문장들로 분해될 수 있다.

함의 트리를 구성하기 위해 최대 깊이에 도달하거나 중단 조건을 만족할 때까지 이러한 하위 문장들을 다음 트리 계층의 새로운 문장으로 재귀적으로 분해한다.

초기 문장 생성과 문장 분해는 LLM prompting을 활용한다. (see implementation details)

Video-language entailment verification

함의 트리가 주어지면, framework는 정렬된(grounded) video content를 증거로 사용하여 문장들을 검증한다. 함의 트리에 포함된 각 문장은 video를 분석함으로써 입증되거나 반박되어야 한다.

Question-aware video captioning

Video가 주어지면, 시각 정보를 상세한 text 정보로 변환한다. Vido frames을 VLM-based captioner에 입력하여 각 frame에 대해 caption을 얻는다. frame을 개별적으로 captioning하면 VQA에 중요한 세부사항을 놓치거나 불필요한 정보를 포함할 수 있다. question이 지시하는 anchor fact를 추출하여 captioner에 사전 지식으로 제공함으로써 관련성 높은 caption 생성을 유도한다. 또한, 각 현재 frame에 대해 이전 모든 frames의 captions을 함께 제공하여 captioner가 과거로부터의 tmporal context를 포착하도록 한다.

Video evidence grounding

Commonsense VQA에서, question이 fact statement를 중심으로 어떻게 추론하느냐에 따라 정답에 필요한 증거는 특정 video 순간들로부터 수집될 수 있다. Temporal reasoning의 경우, 정답은 관련 사실의 시점 이전이나 이후에 발생한 순간들로부터 추론되어야 한다. 이러한 직관에 따라, 답변에 핵심적인 순간들을 국소화하기 위한 two-step evidence-grounding strategy를 설계한다.

frame 단위 captions이 주어지면, fact statement과 가장 관련성이 높은 key frame을 검색(retrieve)하며 이를 anchor frame이라고 부른다. 단순한 검색 방식은 각 caption을 fact statement와 specific metrics를 사용해 비교하여 anchor frame을 식별하는 것이다. 하지만 해당 논문에서는 structured semantic retrieval strategy를 사용한다.
각 frame과 fact statement의 text 설명을 structured triplets으로 변환한다. triplets은 구조화된 의미를 통해 각 frame 내 객체들의 속성과 관계를 포착한다.
fact statement의 triplets을 query로 사용하여 anchor frame retrieval을 수행하도록 LLM을 prompt한다. LLM은 이후 가장 관련성이 높은 frame ID (timestamp)를 식별하여 반환한다.
question에 내재된 temporal relations를 반영하기 위해, anchor frame을 중심으로 살펴봐야 할 최종 순간을 결정한다. anchor frame을 기반으로, question을 고려하여 "look ahead, look behind, look around" 중 하나를 선택해 탐색을 수행한다.

Visual-text statement prover

grounded visual evidence M이 주어지면, 각 문장이 참인지 거짓인지가 평가된다. 여기서 statement prover(문장 판별기)로 VLM을 사용한다. Prover는 해당 문장에 대한 VLM 내부의 신념(belief)을 탐색함으로써 함의 트리 내 각 문장을 평가한다. 각 문장은 True or False 두 가지 선택지를 갖는 이진 QA task로 변환된다. 이후 이진 QA prompt로 Prover를 직접 질의하고, 단어의 다음 토큰 예측 확률을 사용해 모델의 belief를 도출한다. 두 선택지의 예측 logit을 정규화하여 해당 문장의 신뢰도 점수를 얻는다.

Dynamic entailment tree expansion

Figure 3. Illustration of dynamic tree generation and backtrace. In Step-3, when the proof score of the left statement calculated from its child nodes is less than its direct score (0.63 < 0.8), its decomposition is pruned and stops.

지금까지는 사전에 정의된 깊이를 갖는 함의 트리를 구성하기 위해 문장 분해를 재귀적으로 수행하였다. 그러나 모든 문장이 재귀적으로 검증될 필요는 없으며, 특히 VLM에 의해 쉽게 True or False로 판단될 수 있는 문장들이 그러하다. 또한, 깊이가 증가함에 따라 일부 문장들은 원자적이며 직접 검증이 가능하다. 따라서 reasoning 과정의 효율성을 높이기 위해 함의 트리를 동적으로 확장하는 전략을 추가로 채택한다.

각 문장은 prover가 제공하는 두 개의 신뢰도 점수와 연결된다

The direct score: 문장에 대한 prover model의 belief를 나타낸다.
The proof score: 문장의 직접적인 하위 문장들의 점수를 곱하여 계산되며, model이 문장을 얼마나 확신을 가지고 증명할 수 있는지를 나타낸다.

문장에 대해, 분해의 목표는 VLM이 문장의 참/거짓을 직접 평가하는 것보다 더 신뢰할 수 있고 설득력 있는 증명 경로를 구축하는 것이다. 동적 트리 확장 과정에서, 분해가 문장의 점수를 향상시키지 못할 경우 해당 분해는 pruning되며, 그 문장 노드는 함의 트리의 리프 노드가 된다. 이 기준은 유익한 분해만을 유지하도록 보장하여, 트리 추론 과정의 효율성을 크게 향상시킨다.

Reasoning over the entailment tree

마지막으로, 함의 트리를 따라 backtrace를 수행하여 각 최상위 문장의 신뢰도 점수를 계산한다. 전체 framework는 최상위 계층에서 가장 높은 점수의 증명을 갖는 문장에 해당하는 정답을 선택한다.

De-biasing commonsense VQA answer sets

Figure 4. Illustration of commonsense bias in video question answering. The example is selected from the NExT-QA dataset.

Figure 5. Prompt used for rewriting answers on NExT-QA

Video-grounded entailmlent trees의 reasoning 능력을 입증하기 위해서는, model의 추론을 강제하는 commonsense VQA benchmark를 사용한 평가가 필수적이다. 최근 연구들은 VQA datasets에 shortcut이 존재하여, VLM이 video 기반 reasoning이 아니라 textual association에 기반해 문제를 해결할 수 있음을 보여주었다. VQA benchmarks이 점점 시간적(after, before) 또는 인과적(how, why, what if) 관계와 같은 commonsense reasoning 능력에 초점을 맞추고 있음에도 불구하고, 이러한 reasoning shortcuts은 평가의 타당성에 영향을 미친다.

이를 위해, 우리는 commonsense VQA 정답 집합에서 reasoning shortcut을 완화하는 de-biasing 절차를 고안한다. 이 절차는 question과 answer는 그대로 유지한 채, 정답 선택지의 오답만을 재작성함으로써 multiple-choice VQA benchmark(e.g., NExT-QA)를 변환한다. LLM(LLaMA-3)을 prompt하여 각 원본 QA 집합에 대해 이 재작성 절차를 수행하도록 한다.

Table 1. Impact on image and video-based VLMs on the original NExT-QA, IntentQA, and VideoMME test sets. Our framework increases accuracy of all video- and image-based VLMs by1-4% on average across all data partitions. Temporal and action partitions benefit most.

Table 2. Results on de-biased QA sets. Video-based VLMs show significant decreases in the rewritten de-biased set. In contrast, our framework demonstrates much greater robustness on the rewritten set.

Table 3. Comparison with state-of-the-art. Results for NExT-QA and IntentQA are reported under the de-biased set (the results on the original sets are similar; we provide them in the Appendix). The 'Reasoner' in these approaches is similar to the "Prover" in our framework. The captioner for all methods is CogAgent. Despite other methods relying on much stronger reasoning models, our approach yields competitive performance (four state-of-the-art resuls) and high parameter efficiency (257xfewer than GPT-4 reasoners).

Table 10. Comparison results with state-of-the-art. Results for NExT-QA, IntentQA, and Video MME are reported under its original test set. The 'Reasoner' in these appoaches is similar to the 'Prover' in our framework. The captioner for all methods is CogAgent. Despite other methods relying on a much stronger reasoning model, our appoach yields competitive performance and reaches state-of-the-art results in four out of eight data partitions. Moreover, the reasoner we adopted is 250x smaller than the others.

이 논문을 읽고 처음에는, VideoTree의 성능이 더 높은데 굳이 왜 이 방법을 써야하는거지? 라는 생각이 들었다. 그런데 숫자(성능)만이 중요한 게 아니라, 해당 논문은 정말 VQA에서 VLM/LLM을 사용할 때 model이 task의 본질을 이해하고 수행하는 건지를 고려해서 제안한 방법이라는 생각이 들었다. 좀 해석하기 어려운 논문이긴 했지만, VideoQA 논문으로 읽어 보기에는 좋은 것 같다.

On the Faithfulness of Vision Transformer Explanations

ynnnxxi — Tue, 14 Oct 2025 10:57:24 +0900

On the Faithfulness of Vision Transformer Explanaitons (CVPR 2024)

Wu, Junyi, et al. "On the faithfulness of vision transformer explanations." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

Abstract

Vision Transformer를 해석하기 위해서 post-hoc explanations는 input pixels에 중요도 점수(salience scores)를 할당하여 사람이 이해할 수 있는 heatmap을 제공한다. 그러나 이러한 해석이 실제로 model's output의 true rationales를 반영하는 지 아직 충분히 탐구되지 않았다. 이 차이를 해결하기 위해 해당 논문은 faithfulness criterion of explanations를 연구한다.

☞ 할당된 중요도 점수가 해당 input pixel이 model's prediction에 미치는 영향을 정확히 나타내야 한다

Faithfulness를 평가하기 위해, 새로운 평가 지표 Salience-guided Faithfulness Coefficient (SaCo)를 제안한다. 이는 중요도 분포의 핵심 정보를 활용한 새로운 평가 척도이다. 서로 다른 pixel group 간의 pair-wise comparison을 수행하고, 중요도 점수 차이를 집계하여 explanation's degree of faithfulness를 나타낸다.

기존의 평가 지표들은 advanced explanation method & Random Attribution을 구분하는 데 어려움을 겪으며, 결과적으로 faithfulness property를 제대로 포착하지 못한다.

Introduction

Figure 1. Explanation result and illustration of two perturbation manners: cumulative perturbation and our SaCo perturbation. Previous metrics perturb the pixel subsets cumulatively. In contrast, the SaCo perturbs them individually to directly compare their influences.

위쪽: 기존 방법 (Cumulative Perturbation)

아래쪽: 제안한 방법 (Individual Perturbation, SaCo)

input: 원본 image

explanation of "elephant": "elephant" class에 대한 heatmap

→ heatmap의 붉은색 영역: model이 해당 부분은 "elephant"를 판단하는데 중요하다고 본 영역

<Cumulative Perturbation>

중요도 점수 순위에 따라 상위 pixel부터 점점 누적해서 remove
가장 중요한 pixel 10%, 20%, 30%, ... 이런식으로 점차 넓혀 가는 방식

→ remove 0-90%: 상위 90% pixel을 모두 remove. 사실상 거의 모든 중요한 영역이 제거되어, 남은 부분이 거의 없음

→ remove 100%: 전체 pixel을 제거하면 완전한 회색 화면이 됨

문제점
- 각 구간의 개별 영향력을 구분할 수 없음
  - 0-10%와 90-100% 구간의 영향을 따로 비교하고 싶어도, 이미 앞 구간 pixel이 모두 제거된 상태이기 때문에 영향을 분리해서 측정할 수 없음
- 누적 효과 때문에 perturbation 구간이 서로 간섭함
  - 상위 90%를 제거했을 때의 영향에는 앞서 제거된 모든 구간(0-80%)의 영향이 섞여 있음 → 정확한 비교 불가
- faithfulness를 세밀하게 검증할 수 없음
  - 각 pixel group(중요도 순위별 구간)이 model prediction에 미치는 상대적 영향 차이를 직접 확인할 수 없음

<Individual Perturbation (SaCo)>

각 중요도 구간별로 개별적으로(독립적으로) perturbation 수행
0-10%만 remove, 10-20%만 remove, ... → 각각을 독립된 실험으로 수행하고 model 반응을 개별적으로 측정

→ remove 0-10%: 상위 10% 영역막 remove. model의 "elephant" 확신도 감소할 것임

→ remove 80-90%, remove 90-100%: 중요도 점수가 낮은 pixels 제거. 이 경우 model의 prediction 확률은 거의 변하지 않을 가능성이 큼

장점
- pixel group별 영향력을 개별적으로 비교 가능
  - '상위 10% remove' VS '하위 10% remove' 의 model 반응 차이 직접 측정 가능
  - 각 group이 prediction에 미치는 실질적인 영향력 차이를 정량화 할 수 있음
- faithfulness 검증에 더 적합
  - 정말로 중요도가 높은 영역이 model prediction에 큰 영향을 주는지 직접 확인 가능
  - 중요도 점수의 크기(magnitude)와 model 반응 간의 상관관계 평가 가능
- SaCo 계산에 사용됨
  - 각 구간별 반응 차이를 기반으로 SaCo(Salience-guided Faithfulness Coefficient)를 계산
  - model 반응이 중요도 점수의 순위 및 크기와 얼마나 일치하는 지를 계량적으로 표현

Computer Vision 분야에서 Transformer의 광범위한 사용은 blackbox nature를 해석해야 할 필요성을 강조한다. 이는 전통적인 post-hoc interpretation methods - 주로 MLP와 CNN을 위해 설계된 methods에 challenge이다.

(post-hoc 참고! ☞ 2025.10.08 - [Concept] - post-hoc)

Vision Transformer에 특화된 new explanation paradigms을 개발하려는 연구들 활발히 이루어지고 있다. 여기에서는 attention mechanisim이 핵심적인 역할이다. 이러한 explanation methods는 attention distribution을 통합하여 input image patch로부터 추출된 tokens에 대해 중요도 점수를 추정한다. 이후 이러한 점수들은 pixel space 전체로 interpolation 되어 시각적으로 설득력 있는 heatmap을 생성하고, 이는 human intuition과 잘 맞는다.

최근 연구들은 이러한 해석이 ture reasoning process of the Transormer model을 얼마나 정확하게 반영하는지 평가하는 것이 매우 중요하다고 주장하며, 이를 faithfulness이라고 칭했다.

post-hoc explanations의 품질을 평가하기 위해, 최근 연구들은 일반적으로 ablation approach를 채택했다. 이 방법은 평가 중인 설명 기법에 의해 가장 중요하거나 덜 중요하다고 식별된 input image pixel을 교란(perturb)하는 절차를 포함한다. 예를 들어 가장 높은 중요도 점수를 가진 pixel을 교란한 후 model의 정확도가 감소하는 지를 관찰하고, 해당 설명의 타당성을 간접적으로 검증한다.

Ablation approach

idea: 중요하다고 표시된 부분을 실제로 없애보는 실험

☞ explanation method가 부여한 salience score가 신뢰할 만한지 직접 검증하는 방식

Example)

explanation method가 특정 부분에 높은 점수를 부여했다고 가정 (= 중요한 부분)
이 부분의 pixel을 perturb (해당 영역을 masking, noise 추가 등..)
이 image를 model에게 다시 주고 예측이 얼마나 나빠지는 지를 관찰

→ model의 정확도가 크게 떨어짐: 진짜 중요한 부분 제거 ☞ explanation method's faithfulness good

→ model의 정확도가 거의 변하지 않거나 그대로라면: 중요하다고 했던 부분은 실제로 중요하지 않음 ☞ explanation method's faithfulness bad

이러한 전략들이 널리 사용되고 있음에도 불구하고, 해당 연구는 기존 method들이 모두 faithfulness의 정도를 적절히 평가하지 못하고 있음을 드러내고, core assumption of faithfulness를 명확하게 규정한다.

☞ 중요도 점수의 크기(magnitude)가 예상되는 영향 수준을 나타낸다.

(i) 더 높은 점수를 받은 input pixel은 낮은 점수를 받은 pixel보다 model의 예측에 더 큰 영향을 미칠 것으로 기대됨

(ii) 중요도 점수 차이가 큰 두 pixel group은 model 예측에 미치는 영향의 차이 또한 더 클 것으로 예상됨

이러한 요구사항(desiderata)을 충족하기 위해 faithfulness를 포괄적으로 평가하려면 다음 두 가지가 필요하다.

(i) 중요도 크기가 다른 input pixels의 영향력의 예상 차이를 반영

(ii) 중요도 점수의 차이를 정량화하여 그 영향력의 예상 차이를 반영

그러나 기존 평가 지표들은 두 측면에서 모두 부족하다. 누적 교란(cumulative perturbation)에 의존하며, magnitude distribution 에 내재된 정보를 고려하지 않는다.

Faithfulness가 model의 행동을 올바르게 설명하기 위해 필수적이라는 점을 인식하고, 해당 논문에서 새로운 평가 framework인 'Salience-guided Faithfulness Coefficient (SaCo)' 를 제안한다.

SaCo는 설명 기법이 model의 행동과 얼마나 일치하는지를 분석한다. 제안된 지표는 서로 다른 중요도 점수를 가진 pixel subset에 대한 통계적 분석을 수행하고, 이들의 model prediction에 대한 양향을 비교함으로써 작동한다. 중요도 점수 분포는 해당 pixel들의 실제 영향과의 alignment 정도에 따라 평가된다.

→ 높은 중요도 점수를 가진 pixel subset이 낮은 점수를 가진 subset보다 model prediction에 더 큰영향을 미친다면 (기대한대로) 해당 pair는 faithfulness 기준을 만족하는 것으로 간주

결과적으로, 두 subset 간의 중요도 점수 차이 (기대의 정도)는 측정된 결과에 positive accumulation을 한다. 반대로 기대를 충족하지 못한 pair는 violator로 식별되어 결과에 negative contribution을 한다. 따라서 SaCo는 서로 다른 pixel 간 명시적 비교를 포함하고, 이들의 예상 영향 차이를 포착함으로써 core assumption validity를 검증하는 데 적합하다.

Contibution

설명 기법이 faithfulness의 핵심 가정에 얼마나 부합하는지를 평가하기 위한 new 지표인 SaCo 개발
실험적으로 SaCo가 의미 이쓴 설명 기법과 Random Attribution을 명확히 구분할 수 있음을 보여주어, 유용한 banchmark를 제시한다.
현재 attention 기반 설명 기법의 설계 중 일부 요소들이 faithfulness를 변화시킬 수 있음을 밝혀내고, gradient 정보와 aggregation rules의 중요성을 강조하였다. 이를 통해 Vision Transformer 해석 가능성 연구의 향후 발전 방향을 제시한다.

concept만 가볍게 작성. AI리버스 수업에서 review하기 위해서 읽은 논문이지만, 이런 분야가 있고 연구를 한다는게 신기하다고 생각했다...

post-hoc

ynnnxxi — Wed, 8 Oct 2025 22:33:12 +0900

post-hoc(사후 해석) : model이 이미 학습된 이후(post-hoc)에

"model이 왜 이런 예측을 했는지"를 나중에 해석하는 방법

model 내부를 다시 train하거나 바꾸지 않고, 이미 trained model을 해석만 하는 방법!

☞ ☞ 모델을 바꾸지 않고, 결과를 설명하는 데 focus

Example)

Vision Transformer가 '이 image는 고양이다' 라고 예측 했다고 가정,

post-hoc 방법은 이 상태 그대로 (모델이 이미 내린 이 결정을 바꾸지 않고)

"어떤 image 부분(piexel, patch, ...)이 '고양이' 판단에 가장 큰 영향을 미쳤는가?" 를 찾아냄

☞ 이를 위해서 pixel 별로 중요도 점수(salience score)를 계산하고, 이 결과를 heatmap 형태로 시각화함

☞ 이 heatmap을 사람이 보고, 'model이 고양이의 얼굴 부분을 보고 '고양이'라고 판단했구나' 라는 생각을 할 수 있게 한다.

Question Aware Vision Transformer for Multimodal Reasoning

ynnnxxi — Wed, 24 Sep 2025 22:24:34 +0900

Question Aware Vision Transformer for Multimodal Reasoning (CVPR 2024)

Ganz, Roy, et al. "Question aware vision transformer for multimodal reasoning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024.

Abstract

Vision-Language models는 multimodal reasoning에서 눈에 띄는 발전을 가능하게 했다. 이러한 architecture는 보통 vision encoder, LLM, visual feature를 LLM's representation space에 정렬시키는(align) projection module로 구성된다.

발전에도 불구하고 critical limitation이 남아있다. ☞ vision encoding 과정이 user query(often in the form of image-related question)와 분리되어 있다는 점이다. 이 결과 생성된 visual features는 image의 query-specific elements에 최적화되어있지 않을 수도 있다.

이를 해결하기 위해 해당 논문은 multimodal reasoning을 위한 Question Aware vision Transformer approach (QA-ViT)를 제안한다. vision encoder 내부에 question awareness를 직접 주입(embed)한다. 이 integration은 제시된 question과 관련된 image 측면에 초점을 맞춘 dynamic visual features를 output한다.

QA-Vi는 model-agnostic이며 어떤 VLM architecture에도 효율적으로 통합될 수 있다.

Introduction

Figure 1. Question-Aware Vision Encoding. Comparative illustrations for VQAv2 (upper) and TextVQA (lower) predictions of ViT+T5 and QA-ViT+T5 models. Employing GradCAM highlights the focus areas with respect to key terms in the posed questions. This vividly demonstrates the motivation behind QA-ViT: enhancing ViT with the question enables it to focus on the relevant image aspects, resulting in more accurate predictions.

Question-Aware vision encoding의 핵심 idea와 효과를 보여주는 figure.

Image → ViT → LLM: 일반 VLM 구성. image는 ViT로 feature를 뽑고, 그 결과를 LLM에 넘겨 답을 생성함. 이 때 question은 LLM만 봄
Imgae → ViT → QA-ViT → LLM: 제안하는 방식. question이 vision쪽(QA-ViT)에도 주입되어, visual feature가 question에 맞게 조정됨 (= quesion-aware encoding)

문제점(기존): vision encoder가 question과 분리되어 있어서 vision feature가 question의 keyword(nose, top blue sign..)에 맞춰지지 않음 → 엉뚱한 곳에 주목, 오답
해결(제안): QA-ViT로 question을 vision encoding 단계에 직접 반영함. → question 관련 영역에 더 강한 attention → 정확한 예측으로 이어짐

최근 몇 년간 VLM architecture는 중추의 연구 분야로 떠올랐고, Multimodal reasoning 영역에서 상당한 진전을 이끌었다. 이러한 architecture는 근본적으로 visual data와 textual 사이의 간극을 메우는 것을 목표로 한다. model이 visual and textual information 모두를 기반으로 해석하고 이해하며 컨텐츠를 생성할 수 있게 한다. 이러한 modality의 융합은 다양한 task를 가진다.

Multimodal VL achitecture의 핵심에는 vision-language modeling의 개념이 있다. 이러한 model은 일반적으로 세 가지 필수 단계를 포함한다.

1. Unimodal vision architecture가 image에서 meaningful information을 추출한다. 보통 vision encoder는 frozen ViT, often based on CLIP이다.

2. Projection module이 vision과 language 사이의 간극을 메우며, visual features을 language model이 이해하고 처리할 수 있는 feature로 변환한다. 이 module은 보통 simple linear layer or MLP or cross-attention-based transformer achitecture이다.

3. Projected visual information과 textual instruction가 LLM에 삽입되어 task가 완성된다.

이러한 model의 성공은 visual content를 이해할 수 있는 능력뿐 아니라, 동반되는 textual instruction의 관점에서 종종 전체 image 내부의 fine-grained details에 초점을 맞추어 이해할 수 있는 능력에 달려있다. 그러나 기존 architecture는 이 측면에서 최적이 아니며, vision encoding을 주어진 question을 인지하지 못한 채 수행하여, user query에 최적으로 정렬되지 않은 visual feature를 ouput한다.

Vision encoder가 fixed size feature sequence를 output함에 따라, 그 안에 encoding되는 information 수준에는 제한이 있다. 상대적으로 높은 abstraction level때문에 image의 low-level details을 무시하거나 간과할 가능성이 높다. 이러한 문제는 nuanced image understanding가 question에 정확하게 응답하는 데 필수적인 상황에서 특히 문제가 된다. 따라서 vision encoder가 single input function에서 conditional function(조건부 함수)로 전환되어야 한다고 주장한다.

이 limitation을 완화하고 textual conditioned vision encoding을 얻기 위해, multimodal reasoning을 위한 Question Aware Vision Transformer (QA-ViT)를 제안한다. Model이 posed question과 inherent context를 understanding한다면, 올바른 답변에 필수적인 관련 image 측면에 직접 대응하는 visual feature을 추출할 수 있다. Textual prompt가 뚜렷한 spatial location에 대응하도록 vanilla CLIP-based ViT and QA-ViT 모두에 GradCAM을 적용한다. Baseline은 region-specific description으로 prompt되었을 때조차 높은 abstraction level feature을 선호하는 경향이 있는 반면, QA-ViT는 관련 image 부분에 훨씬 더 집중한다.

이 approach는 대부분을 freeze 상태로 유지하여 그 visual understanding 능력을 보존하면서 textual representation을 어떤 vision encoder에도 직접 통합한다. 실제로 ViT에 이미 존재하는 self-attention mechanism이 user query를 나타내는 textual encoding에도 주의를 기울이도로고 활용한다.

QA-ViT의 효과를 입증하기 위해, model-agnostic 성격을 활용하여 BLIP2, InstructBLIP, LLaVA-1.5 등 top-performing system에 통합한다. 추가로 사전학습 없이 정렬되지 않은 VL system을 처음부터 학습할 때의 이점을 보여주기 위해, simple ViT+T5 architecture에도 QA-ViT를 통합한다.

Visual question answering, Image captioning, requiring visual, Optical Character Recogniton(OCR) understanding dataset으로 모든 architecture를 학습시키고 이에 따라 평가한다.

Method

Figure 2. Method overview. A high-level illustration of the QA-ViT (highlighted in orange) incorporated into a general VL architecture (depicted in blue). This is achieved by encoding the question Q into features FQ, which are fused into the vision encoder, resulting in question-aware visual features FVQ.

처음 input: image + question Q
Question Encoding
- question Q를 encoding한 뒤 layer별 MLP를 거쳐 vision token과 같은 demension으로 projection
- 이후 vision encoder에 주입할 준비가 된 textual representaion임
Question Fusing
- Vision encoder의 top-L개 self-attention layer에서 visual token과 question token을 concat하여 frozen self-attention에 통과시킴
- visual token이 question token에 attend하며 보정되고, question-aware한 중간 visual feature를 얻음
- 병렬로 gated projection (head)를 통해 안정적으로 합쳐 최종 question-aware visual feature를 얻음
Vision Encoder
- 기본 ViT 자체는 frozen 상태를 유지 (원래 능력 보존)
- 상위 layer의 self-attention input만 확장/주입하여 동작
Projection Module
- question-aware visual feature를 LLM이 이해하는 representation space로 projection
LLM
- projected question-aware visual feature과 text instruction을 input으로 받아 답변 token 생성
- Output 출력

Figure 3. Textual representation fusing. Left: General scheme of the ViT encoder. Right: Zoom in to our fusing mechanism in one of the top-L self-attention layers. The M visual features from the previous layer Fv, are concatenated with K textual features FQ and fed into the frozen self-attention mechanism to obtain M text-attended visual representations F'VQ. Next, a parallel gated projection obtains the question-aware visual features of FVQ.

Question token을 ViT의 top-L개 self-attention layer에 주입해 question-aware visual feature를 만드는 과정.

왼쪽 회색 박스
- 표준 ViT encoder. 여러 block이 쌓여 있고, 각 block은 self-attention - FFN 순서.
- 제안하는 방법은 top-L layer에만 fusion하는 late fusion
오른쪽 파란 박스: top-L layer
- 파란색 tokens: 이전 layer에서 넘어온 visual token M개 (M x C)
- 주황색 tokens: Question Encoding을 거쳐 vision space로 projection된 question token K개 (K x C)
- Attention: frozen self-attention
- Projection: 원래 있던 frozen projection head
- Gated Projection: 새로 추가된 learnable residual projection + tanh
  - question-aware visual feature를 얼마나 어떤 방향으로 보정할지 학습
  - gating으로 크기 조절
    - 학습이 진행되며 β (learnable parameter)가 커지면 해당 경로의 기여도가 증가
  - question token을 self-attention input에 붙이면 output이 달라질 위험이 큼
  - frozen P(.)가 원래 representation의 안전한 anchor 역할
  - Pg(.) tanh( β )가 필요한 만큼한 새로운 정보를 주입해 성능 붕괴를 방지해줌
    - Pg(.): residual
    - tanh( β ) : gate
- F'VQ: attention을 통과해 question-aware한 중간 visual representation
- FVQ: 최종 question-aware visual feature

전체적인 순서
- visual token + question token (concat) → self-attention's input
- frozen self-attention
- visual representation에 대응하는 attention output만 얻음(앞의 M개만 취해 중간 representation)
- projection + gating

Overall Architecture

이 method는 두 가지 근본적인 구성 요소로 이루어진다.

1. Q로 표시되는 question을 Question Encoding module에 입력하며, 이 module은 texual prompt를 처리하고 투영(project)하여, linguistic and visual feature domain간의 간극을 메운다. 이후 textual encoded feature을 Question Fusing module을 통해 frozen vision model 내부에 통합하여 text-aware visual feature를 생성한다. 마지막으로 projection module에 의해 투영되고, instruction embedding과 concat된 후 LLM에 입력되며, LLM이 처리한 후에 전체 system output을 생성한다. 일반적으로 QA-ViT는 오직 vision encoder만 수정하며, architecture의 나머지 부분은 그대로 유지한다.

Question Encoding

natural language prompt를 unimodal vision transformer에 도입하기 위해, 2단계 과정을 제안한다.

1. Question Representation

natural language prompt(question...)를 meaningful representation으로 encoding한다. 기존 LLM의 encoder나 embedding, 또는 designated language model을 사용한다.

2. Representation Projection

MLP를 사용하여 textual representation을 vision model의 feature space로 투영한다. Vision model의 hierarchical structure때문에 서로 다른 layer는 서로 다른 abstraction levels을 가진다. 따라서 더 나은 alignment를 얻기 위해 layer별 MLP를 사용한다.

Question Fusing

Projected textual representation이 주어지면, model-agnostic 방식으로 고정된 ViT architecture에 integrat하기 위해 parameter-efficient fusing mechanism을 제안한다. Vision encoder를 frozen 상태로 유지하는 것은 model의 원래 능력을 온전히 보존하면서 text-conditioned encoding of the image(text로 조건화된 image encoding)를 가능하게 한다.

Fusing Mechanism

Self-attention layer의 input을 visual representation과 projected representations를 concat한 것으로 확장한다. visual 및 question information을 포함하는 K+M 길이의 sequence가 생성된다. frozen self-attention mechanism을 적용하여 textual information에도 주의를 기울이면서 attention score와 output을 생성하여 cross-modal attention을 가능하게 한다. input visual representation에 대응하는 attention output을 선택하여 얻는다.

기존의 fixed projection head와 병렬로 추가적인 projection과 learnable한 gating mechanism을 도입한다. 이 module은 fixed self-attention layer에 question information을 도입함으로써 발생하는 distribution shift를 보상한다(compensate). 이러한 gating의 목적은 residual projection information이 기존 information과 점진적으로 블렌딩되도록 하여 feature의 큰 변형 및 전반적 성능 저하를 피하는 것이다. 이러한 gating은 additional projection layer의 output을 tanh( )와 곱함으로써 수행된다.

Integration Point

Fusing mechanism에서 중요한 것은 textual representations를 vision transformer layers에 통합하는 지점이다. 구체적으로, late fusion을 수행하며, L < N일 때 N-layer ViT의 상위 l개 self-attention layer에서 융합을 적용한다.

lower layer는 주로 low-level visual detail을 포착하고, higher layers는 high-level concept에 초점을 맞춘다. 따라서 fine-grained details를 간과할 가능성은 higher layers에서 나타날 것으로 예상된다.

이 논문도 question-aware concept을 사용했다.

'MovieChat+' 논문은 MovieChat에서는 'encoding 각각 따로 하고, question과 frame간의 similarity를 계산해서 question과 관련된 frame을 집중해서 보자' 이고, 해당 논문은 'encoding부터 question 정보를 함께줘서 question-aware visual feature를 얻은 후에 사용하자' 이다.

둘 다 question-aware한 visual information을 사용할 수 있는 방법이다. 이 두 가지 논문을 읽으면서 question-aware한 정보를 어떻게 얻을 수 있을까에 대한 힌트를 조금은 얻은 것 같다.

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

ynnnxxi — Wed, 24 Sep 2025 16:47:37 +0900

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering (TPAMI 2025)

Song, Enxin, et al. "Moviechat+: Question-aware sparse memory for long video question answering." IEEE Transactions on Pattern Analysis and Machine Intelligence (2025).

Abstract

최근 video foundation model과 large language model을 통합하여 video understand system을 구축하면 특정 vision task의 limitation을 극복할 수 있다. 하지만 기존 방법들은 complex spatial-temporal module을 사용하거나 video understanding을 위한 visual fearure를 추출하기 위해 추가적인 perception model에 크게 의존하여 short video에서만 좋은 성능을 보인다. Long video의 경우 long-term temporal connection에 수반되는 계산 복잡도와 memory 비용이 크게 증가하여 추가적인 challenge를 야기한다.

해당 논문에서는 Atkinson-Shiffrin memory model의 hierarchical memory structure를 활용하고, 결합된 형태로 Transformer의 token을 carriers of memory로 사용한다. ☞ This paper propose MovieChat within a training-free memory consolidation mechanism to overcom these challenges.

이는 인접한 frame을 시간적으로 병합하여, dense frame from short-term memory를 sparse token in long-term memory로 전환한다. 추가로 trainable module없이 zero-shot approach를 사용하여 pretrained large multi-modal model이 long video를 이해하도록 확장한다.

추가적으로 MovieChat-1K라는 benchmark도 함께 제안하고, long video understanding에서 SOTA를 달성했다.

Introduction

Fig. 1. Video random-access memory (VRAM) cost under gigabyte (GB) (y-axis) v.s. frame number (x-axis) comparison. We test the visual-only inference of all methods at a resolution of 224 x 224 without frame sampling. While the previous method can only support around 100 frames of inference, MovieChat can handle videos with > 10K frames on a 24GB graphics card. MovieChat has a 10000 x advantage over other methods in terms of the average increase in VRAM cost per frame (21.3KB to ~ 200MB per frame).

Frame 수(가로 축)가 늘어날 때, VRAM 사용량(세로축)이 어떻게 커지는 지 비교하는 graph

→ MovieChat(this method)에서 frame의 수가 늘어나도 VRAM 사용량이 증가하지 않고 안정적임

LLM에 multi modality를 도입하여 Multimodal Large Language Model(MLLM)로 확장했고, 이를 통해 multimodal 추론과 이해가 가능해졌다. MLLM은 다양한 multimodal task에서 놀라운 능력을 보여줬다. LLM과 다른 task 특화 model에 비해 MLLM은 시나리오에 대해 인간과 유사한 해석, 사용자 친화적 interface, 더 폭넓은 능력을 제공한다.

기존 vision-centric MLLMs은 pre-trained LLM과 visual encoder를 additional trainable module OR simple projection layer를 함께 활용하는 paradigm을 따른다. 이러한 paradigm을 따라 MLLM을 구축한 방법과, 다른 paradigm의 연구들은 video understanding을 위한 visual information을 얻기 위해 complex spatial-temporal modules OR heavily additional perception tools에 크게 의존한다. Long video를 처리할 때, long-term temporal connections에 수반되는 computational complexity & memory cost가 크게 증가하여 추가적인 challenge를 야기한다. 더 나아가, 이러한 system을 평가하기 위해 표준화된 benchmark도 부족하다.

해당 논문의 저자들에 따르면, 이 연구는 long video understanding task (>10K frames)를 처음으로 다룬다. long video understanding에 있어서, computational complexity, memory cost, long-term spatial-temporal connection이 주요 task라고 주장한다. MovieChat는 추가적인 trainable temporal module 없이 zero-shot approach를 사용하여 pre-trained MLLM이 long video understanding하도록 한다.

Atkinson-Shiffrin model은 인간의 기억을 세 단계로 설명한다:

1. 감각 기억(Sensory memory)은 raw input을 잠시 보류

2. 단기 기억(Short-term memory)는 process를 위해 제한된 정보를 관리

3. 장기 기억(Long-term memory)는 광범위하고 영구적인 저장 제공

이러한 hirarchical structure에서 영감을 받아 long video understanding을 위한 memory mechanism을 제안한다.

encoding된 question과의 similarity에 따라 인접한 frame을 시간적으로 결합하여 dense short-term token을 long-term memory token으로 통합(consolidation)한다. 이 mechanism은 빠르게 갱신되는 short-term memory와 더 compact한 long-term memory로 구성된다.

Updated version인 MovieChat+에서 memory의 compactness을 향상시키기 위해 vision-question matching-based memory consolidation mechanism을 설계한다. 이 mechanism은 vision-language model의 예측을 관련 visual content에 유의미하게 anchoring시킨다.

MovieChat+는 VRAM cost 측면에서도 기존 다른 방법들을 뛰어 넘는다.

Contribution

Pre-trained MLLM을 활용하고, zero-shot/training-free한 memory consolidation mechanism을 활용하여, long video (> 10K frames)를 지원하도록 설계된 최초의 framwork 'MovieChat'를 제안한다.
Updated version 'MovieChat+'는 training-free vision-question matching-based memory consolidation technique을 사용하여 memory compactness를 향상시켰다.
최초의 long video understanding benchmark 'MovieChat-1K'를 공개했다.

MovieChat

Fig. 3. Illustration of MovieChat, a completely training-free framework with question-aware memory consolidation mechanism. MovieChat extracts video features with a temporal sliding window and represents them in token form, which are then sequentially fed into the short-term memory frame by frame. When the fixed-length short-term memory reaches its preset limit, the earliest tokens are popped and consolidated into the long-term memory. Our approach incorporates two distinct inference modes: the global mode, which exclusively utilizes the long-term memory, and the breakpoint mode, which additionally incorporates the current short-term memory as part of the video representation. The breakpoint mode allows for understanding the video at a specific moment in time. After passing through a projection layer, the video representation is inputted into a large language model for interaction with the user.

Overview

제안하는 방법인 MovieChat은 Frame-wise visual feature extractor, Short-term memory module, Long-term memory module with question-aware consolidation strategy, Video projection layer, Large Language Model (LLM) 등 여러 핵심 구성 요소로 이루어진다. (Fig. 3. 참고)

GPU memory와 RAM에 많은 frame을 동시에 저장해야 하는 문제를 해결하기 위해, sliding window approach를 사용해 video를 효율적으로 처리한다.

두 가지 inference mode - breakpoint mode & global mode를 지원한다.

Breakpoint mode는 video의 특정 순간을 이해하는 데 사용되어 해당 frame이나 장면을 기반으로 answer를 제공한다.

Global mode는 video 전체를 하나로 이해하는 데 사용되어 전반적인 내용과 맥락에 대한 포괄적인 이해를 가능하게 한다.

Visual Feature Extraction

visual feature를 추출하기 위해, video-based foundational models를 사용하는 대신에 simply image-based model을 사용하여 frame-wise features를 token 형태로 얻는다.

ViT-G/14(EVA-CLIP) & BLIP-2의 Q-former를 visual feature extractor로 사용했다. Text와 alignment가 잘 되는 video foundation model이 드물고, 제안한 memory mechanism이 temporal feature를 효과적으로 포착할 수 있기 때문에 사용했다. visual feature는 sliding window 방식으로 추출된다.

Short-term Memory

Short-term memory는 K개의 frame을 fixed-length buffer로 유지하여 frame token을 임시로 저장한다. 추가 처리가 없는 상태에서 sliding window로 G번 추출된 visual features는 short-term memory를 구성하는 데 사용된다.

새로운 batch of visual tokens가 들어오면, short-term memory가 수용량에 도달할 때 현재 저장된 frame을 memory consolidation module로 보내고, short-term memory를 비운다. Memory consolidation module에서 얻는 video feature는 long-term memory를 보강하고, short-term memory re-initialize한다. Re-initialize를 하는 이유는 서로 다른 sliding window에서 정보를 전달하기 위함이다. 이 과정을 거치며 더 효율적인 압축(compression)을 할 수 있다.

Question-aware Long-term Memory (MovieChat+)

Long-term memory는 catastrophic knowledge forgetting 문제를 회피할 수 있고, 이는 long video understanding task를 처리하는 데 중요하다. Short-term memory에 저장된 feature는 dense tokens지만, GPU memory와 computation cost의 한계로 인해 short-term memory에서 나오는 모든 tokens를 long-term memory buffer에 저장하는 것은 불가능하다. 또한, video에서 상당한 temporal redundancy가 관찰된다.

실제로는 전체 long video contents의 일부만이 주어진 question과 관련된다. 이를 위해 updated version인 MovieChat+에서는 특정 question과의 관련성에 기반하여 인접한 frame들을 병합(merge)한다. 이를 통해 video feature representation을 간소화하고 encoding 효율을 향상시킨다. 이 방법은 dense token을 관련 question을 중심으로 sparse memory로 변환하며, 이는 long-term memory에 저장된다.

pre-trained text encoder를 사용하여 특정 question Q를 visual feature과 동일한 embedding 공간으로 encoding한다.

그런 다음 short-term 내 각 frame feature과 encoding된 question 사이의 평균 cosine similarity를 계산한다.

question을 중심으로 video를 볼 때, 대체로 크게 관련 없는 구간은 건너뛴다.

short-term memory의 visual feature가 questino과 매우 관련이 있으면 long-term memory로 더 적게 merge하고, 그렇지 않으면 더 많이 merge한다. 평균 similarity를 threshold와 비교하여 question과의 관련성을 평가한다.

이후, 인접 frame에서 가장 유사한 token을 merge하여 주기적으로 memory consolidation을 수행한다. 여기서 N개의 embedding된 token사이의 평균 cosine similarity를 계산하는데, token은 각 frame의 정보를 효과적으로 요약할 수 있다.

매번 merge 연산 후 M개의 frame을 유지하는 것이 목표이며, 이는 long-term memory에 저장된 풍부한 정보도 함께 embedding한다. merge연산은 각 consolidation 연산에 대해 token 수가 사전 정의된 값 M에 도달할 때까지 반복적으로 수행되며, 그 결과 output video feature을 얻는다.

전체 video의 dense token은 question과의 similarity에 따라 다양한 정도로 compression되어 long-term memory 내에 저장된다.

Inference

이전 방법들은 항상 video 전체의 representation을 사용하여 understanding 및 question-answering을 수행한다. 이러한 방법은 광범위한 overview를 제공하지만, long video에서 특정 순간이나 세부 사항을 정확하게 localizing하는 데 어려움을 겪는다. 이를 위해서 long video understanding task를 위해 global mode와 breakpoint mode 두 가지 inference mode를 제안한다.

Video 전체에 대한 understanding & question-answering으로 정의된다. 이 mode에서의 초점은 video 전체 length에 걸친 details를 포착하는 데 있다. 따라서 video representation으로 long-term memory만을 사용한다.

Video 내의 특정 순간을 understanding하는 것으로 정의된다. Event는 연속성을 가지므로, short-term memory에 저장된 순간과 직접적으로 관련된 정보뿐 아니라 long-term memory에 저장되어 있는, 간접 관련 정보도 고려해야 한다. 따라서 특정 시각에서 질의할 때 video representation은 short-term memory, long-term memory, current video fream feature를 모두 사용한다. 이런 요소들을 단순히 concat하는 것이 뛰어난 결과가 나옴을 관찰했다.

A New Benchmark: MovieChat-1K

Long-form understanding evaluation을 평가하기 위한 new benchmark.

규모/구성:
- 1K개 video clip (movie, TV, etc.)
- 14K manual annoataions (question/answer, caption, etc.)
- Update(+): 2K개의 temporal grounding label 추가 (MovieChat+)
Temporal labels:
- breakpoint mode question에만 부여 (global model는 불필요)
- val/test set에만 annoatation (evaluation용, train용 아님)
- 200개 video에서 2K Q/A pair에 temporal segment(start/end) 표기
- length of segments: 대부분 12s 미만, 평균 6.3s (전체 영상은 약 700s)

Limitation

아칙 초기 단계의 prototype이며 몇 가지 limitation을 가지고 있다.

1. Limited perception capacities: approach 성능은 pre-trained short video understanding model에 의해 제약을 받는다.

2. Inadequate Time Processing: long video 내 evens의 지속 시간 비율에 대한 대략적인 추정만 제공하며, 시간적 세부 사항에서 정밀성이 부족하다.

3. Inefficient Reprocessing for New Questions: 새로운 question이 제기될 때마다 매번 전체 video를 재처리해야 한다.

이 논문은 사실 제목만 보고 'Question-aware' & 'Video Question Answering' 이라는 keyword를 보고 흥미로워서 읽게 되었다. question-aware한 concept을 찾다가 발견했는데... 생각보다 내 기대에는 못 미친 논문인 것 같다. Memory를 short-term이랑 long-term으로 나눈 concept과 question과의 similarity를 계산해서 dynamic하게 chunking한다는 게 흥미로운 point였는데 같은 video에서도 새로운 question이 들어올 때마다 매번 전체 video를 다시 처리 해야한다는 점이 큰 limitation으로 느껴졌다. Long video라면 video의 길이가 길텐데, 매번 재처리를 하면 inference time이 길어질 것 같다. 어쩔 수 없는 문제인가? 싶기도 하다.

그리고 너무 오랜만에 저널 논문을 읽어봐서 그런지.. 컨퍼런스 논문과 확실히 느낌이 다른 것 같다. 컨퍼런스 논문은 아무래도 분량 때문에 실전 압축 내용만 담아놓은 느낌이면, 저널 논문은 좀 더 자세하게 떠먹여 주는? 그런 느낌이다. 대신에 뭔가 저널 논문이 더 복잡한 느낌인 것 같기도 하다.

이 논문에서 내가 가져갈 점은 question-aware를 question과 similarity를 계산해서 video frame을 chunking하는 부분을 dynamic하게 했다는 점이랑 memory를 short/long term으로 나누었다는 점 이렇게 두 가지인 것 같다.

Atkinson-Shiffrin

ynnnxxi — Wed, 24 Sep 2025 13:49:55 +0900

Atkinson-Shiffrin 모형: 인간 기억은 '입력 → 저장소 → 출력'의 정보 처리 파이프라이으로 설명하는 다중 저장소(multi store) 이론

** 핵심 **

자극이 "감각 기억 → 단기 기억(작업 공간) → 장기 기억"을 거치며, 그 사이를 control processes(통제 과정)이 조절함

<세 가지 저장소>

1. 감각 기억 (Sensory memory)

input 채널 별 매우 짧은 보존
장면의 '잔상'을 넓게 잡고, attention을 받은 정보만 다음 단계로 이동

2. 단기 기억/작업 공간 (Short-term/Working)

통제 과정(control processes): rehearsal, chunking 등을 통해 정보를 유지/변환
지속 시간: 수 초 (반복 없으면 소실)

3. 장기 기억 (Long-term memory)

상대적으로 지속적이고, 큰 용량
의미적/일화적/절차적 knowledge 등으로 저장
retrieval시, 단기 기억으로 불러와 사용

무엇을 '감각 → 단기'로 옮길 지 선택
rehearsal: 단기 유지 or 의미 연결로 전이
부호화: 의미화, 시각화, 조직화(계층/마인드맵) 등...
검색 단어 사용: retrieval시 context/cue 설계

VRAM

ynnnxxi — Wed, 24 Sep 2025 13:38:48 +0900

VRAM: Video Random Access Memory ☞ Video RAM

* 시스템 RAM: CPU 전용.

VRAM은 그래픽 카드에 있는 전용 memory이다.

그래픽/영상 렌더링이나 deeplearning 연산에서 GPU가 초당 데이터를 지연 없이 읽고 쓰기 위해서 사용한다 !!

VRAM용량이 클수록 더 높은 해상도와 그래픽 품질 설정이 가능하고, 시스템 메모리(RAM)보다 훨씬 빠른 access 작업 성능을 향상시킨다.

주요 점유 항목 (deeplearning에서): model weight, optimizer 상태, 중간 tensor/cache 등..
OOM(Out Of Memory) 방지
Monitoring: nvidia-smi로 전체/프로세스 별 VRAM 확인 가능. (nvitop으로도 가능! 시각화 잘됨)

MEERKAT: Audio-Visual Large Language Model for Grounding in Space and Time

ynnnxxi — Tue, 23 Sep 2025 16:49:42 +0900

MEERKAT: Audio-Visual Large Language Model for Grounding in Space and Time (ECCV 2024)

Chowdhury, Sanjoy, et al. "Meerkat: Audio-visual large language model for grounding in space and time." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.

Abstract

LLM(Large Language Model)의 뛰어난 능력을 활용해서 최근의 MLLM(Multimodal Large Language Model) 연구는 이를 visual, audio와 같은 다른 modality로 확장하고 있다. 그러나 이러한 방향의 발전은 대체로 audio-visual semantics를 coarse한 수준으로 이해하면 되는 task에 초점을 맞춰왔다. 해당 논문에서는 image와 audio를 공간적(spatially)/시간적(temporally)으로 fine하게 이해하는 audio-visual LLM인 MEEAKAT을 제시한다. (MEERKAT, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally)

Optimal transport에 기반한 새로운 modality alignment module과 audio-visual 일관성을 강제하는 cross-attention module을 통해서 audio reffered image grounding, image guided audio temporal localization, audio-visual fact checking과 같은 도전적인 task를 해결할 수 있다.

'AVFIT' dataset을 구축하고, MEEARKATBENCH를 소개한다.

Introduction

Fig. 1: We present MEERKAT, anaudio-visual LLM that can effectively ground both spatially and temporally in image and audio. Our model is adept in tasks that require fine-grained understanding such as Audio Reffered Image Grounding, Image Guided (IG) Audio Temporal Localization & Audio-Visual (AV) Fact-checking. It can also be extended to perform coarse-grained tasks like AVQA & AV Captioning.

- Audio Referred Image Grounding (왼쪽)

audio와 함께 Image가 주어졌을 때, 해당 소리가 나는 물체를 찾고 bounding box로 정확한 위치 표시

- Audio-Visual Fact-checking / IG Audio Temporal Localization (중앙)

audio에서 특정 소리가 언제 나는지 구간을 찾아냄. image와 visual 간 내용의 consistency를 검증하여 사실 여부를 판단함

- AV Captioning / AV Question Answering (오른쪽)

audio+image를 함께 보고 question에 답함. audio+image 정보를 사용해서 장면을 문장으로 기술하는 task (captioning)

Table 1: Comparison of MEERKAT with recent Audio-Visual LLMs. 'Convention' refers to a collection of publicly available data that has been transformed using templates, 'GPT-Prompted' signifies if the generated instructions are obtained/refined employing GPT, and 'Robustness' is the model's ability to tackle negative samples. We compare our method against these approaches in Sec. 5.

Meerkat은 공간.시간 grounding을 동시에 지원하고, end-to-end로 학습되며, 가장 포괄적인 프레임워크임을 보여주는 표이다.

LLMs는 다양한 NLP task에서 뛰어난 성능을 보여 왔으며, 이해 및 추론 능력에서 인간 수준의 정확도에 도달했다. 더 나아가 다른 modality 특히 vision과 결합될 수도 있다. audio는 관련된 시각 장면을 보완하는 경우가 많지만, LLM의 맥락에서는 상당 부분 미개척 영역으로 남아있다. 청취 능력을 갖춘 multimodal LLMs를 구축하면 새로운 응용을 가능하게 할 수 있다.

Table 1을 보면, 선행 연구들은 MLLMs에 audio를 도입했지만, 대부분이 captioning과 question-answering과 같은 coarse-grained task에 초점을 맞추었다.

저자들의 목표는 LLM의 힘을 fine-grained audio-visual understanding에 활용하는 것이다. 하지만 다음과 같은 이유로 어렵다.

1. 서로 다른 task 간 input&output 형식의 불일치가 존재한다.

2. grounding 능력을 갖춘 audio-visual LLM을 학습하기 위한 대규모 dataset이 존재하지 않는다.

기존 LLMs는 coarse-grained task로 제한되어 있으며, fine-grained의 이해 및 추론 능력을 달성하는 데 핵심 구성 요소인 cross-modality fusion을 포함하지 않는다.

이런 challenge를 해결하기 위해 MEERKAT을 제안한다.

MEERKAT은 image와 audio에서 각각 공간적 및 시간적으로 효과넉으로 grounding할 수 있는 최초의 통합 audio-visual LLM framework이다. 이 model은 fine-grained 이해 능력에 핵심적인 두 가지 module을 갖춘다.

Modality alignment module: Optimal transport에 기반하여 weakly-supervised 방식으로 image와 audio patch 간의 cross-modal alignment을 학습
Corss-modal attention module: Cross attention heatmap에서 일관성(consistency)을 강제할 수 있는 모듈

이 두 module을 결합하면 더 나은 joint audio-visual representation을 학습할 수 있으며, 이는 이후 downstream task를 향상시킨다.

Table 2: Task-wise dataset idstribution, dataset details, and metrics. We collect AVFIT, which is a collection of 12 datasets. We denote dataset-wise train/test usage. The visual grounding datasets contain spatial bounding box annotations while the audio temporal localization contains time-interval annotations. We consider audio-visual fact-checking as a fine-grained task as it requires an understanding of spatio-temporal grounding information (refer to Sec. 5.2 for more details). Here B@4: BLUE@4, M: METEOR, R: ROUGE, C: CIDERr. For all our experiments we consider F1@0.5. (dagger): We obtain the bounding box from the segmentation maps.

Meerkat을 뒷받침하기 위해, 다섯 가지 서로 다른 audio-visual task를 통합한 MeerkatBench

다섯 task의 학습을 가능하게 하기 위해, fine-grained audio-visual segmantic을 학습하는 데 난이도가 다양한 300만 개의 instruction tuning sample을 포함하는 대규모 dataset AVFIT도 큐레이션한다.

Contribution

image와 audio에서 grounding할 수 있는 fine-grained spatio-temporal understanding 능력을 갖춘 audio-visual LLM인 MEERKAT을 제안한다.
다섯 가지 audio-visual learning task를 통합한 MeerkatBench와 fine-grained audio-visual semantic learning을 가능하게 하는 새로운 대규모 instruction tuning dataset AVFIT을 소개한다.
다섯 개의 benchmark task에서 평가한 결과, 모두에서 SOTA달성

Methodology

Fig. 2: Overview of MEERKAT. Our model is equipped with fine-grained audio-visual comprehension abilities. When fed with image I, Udio A pairs, the Audio-Visual Optimal Transport alignment (AVOpT) module(B) learns the patch-wise image-audio association to facilitate weak alignment between the two modalities by minimizing the patch-level Wassertein distance. Subsequently, the Audio-Visual Attention Consistency Enforcement (AVACE) module (A) maximizes the region-level alignment by confining the cross-modal attention maps around the objects of interest and minimizing the association with the background. After tokenizing the text insturction tuned Llama 2 model which serves as a unified interface for the downstream tasks. We employ a LoRA-based fine-tuning of the LLM.

Multi-modal Feature Extraction

<image encoder>

batch 크기 k의 input image가 주어졌을 때, pretrained CLIP-ViT-B/16 encoder를 사용하여 image embedding을 추출한다.

<audio encoder>

raw audio input을 audio embedding으로 변환한다. CLAP audio transformer backbone을 audio encoder로 사용한다. 이 pretrained encoder를 활용하여 의미 있는 audio representation을 추출한다.

<LLM>

MEERKAT은 open sourced Llama 2-Chat(7B) large language model backbone으로 채택했다. pretrained LLM의 tokenizer는 text sequence T를 embedding으로 투영한다. image와 audio embedding을 LLM에 전달하기 전에, 서로 다른 modality 간 embedding 차원이 맞도록 추가 선형 계층을 통해 변환한다. (projection) LLM이 audio-visual input을 위한 통합 interface 역할을 하므로, 개별 task를 수행하기 위해 language token에 의존한다.

Audio-Visual Feature Alignment

<Audio-Visual Optimal Transport Alignment Module (AVOpT)>

Optimal Transport(OT) 방법을 포함하는 Earth Mover Distance based Algorithm은 최근 seamese network에서 query와 support image간 patch-level alignment에 활용되었다. 더 나아가 vision-language model에서, OT based algorithm은 patch-word alignment에도 사용되었다.

image(CLIP)와 audio(CLAP) encoder가 분리되어 학습되었기 때문에, 학습된 embedding은 서로 다른 의미 공간에 존재한다. 이런 patch-level alignment가 vision과 audio의 의미적 일관성을 향상시킬 수 있다. 이 pacth-level weak guidance가 global supervision보다 우수함을 실험적으로 보였다. (appendix)

주어진 image I와 audion A pair로부터 pacth-level (local) feature embedding을 각각 얻는다. 이러한 feature representation의 고유한 풍부한 의미 구조를 활용하여 cross-modal relation를 modeling하기 위해, image와 audio 각각에 대해 두 개의 이산 분포(discrete distributions)를 생성한다. 이 두 분포를 매칭하는 동안 최적 운송 계획을 식별한다.Cross domain alignment 과정에서 topological information (위상 정보)을 보존하면서 두 확률 분포 사이의 Wasserstein 거리 (WD)를 계산한다.

<Audio-Visual Attention Consistency Enforcement Module (AVACE)>

Cross-modal interaction은 audio와 visual modality를 정렬하는 데 필수적이다. 더불어서 region-level supervision은 효율적인 위치 추정을 장려할 수 있다. 최근 방법들의 성공에서 영감을 받아, 효율적인 음원 위치 지정을 위해 adapter-based cross-attention strategy를 사용한다. AVOpT의 modality 특화 feature는 대안 modality의 정보 인지(awareness)가 부족하며, 이는 cross-modal attention을 통해 주입될 수 있다. 따라서 audio-visual cross-modal reciprocity(상호 상태)를 가능하게 하기 위해 AVACE modeule을 제안한다.

~~(내가 주목할 부분)~~

Multimodal에서 Cross-attention 방식의 feature fusion은 image 내 관련 객체에 주의를 기울이는 데 효과적이지만, 배경 객체를 포함해 주목된 영역이 image 전반에 분산되는 등 불일치가 발생할 수 있다. 그 이유는 feature embedding 간 interaction의 질에 기인할 수 있다.

예시로, CLAP audio encoder가 바이올린 audio와 짝지어진 '바이올린을 연주하는 남자'와 같은 예시로 pretrain 되었다면, audio representation의 cross-modal knowledge는 image에서 남자와 바이올린 모두에 초첨을 맞추도록 유도한다. 따라서 더 우수한 영역 수준 alignment를 보장하기 위해, 관심 객체의 GT bounding box로 표시되는 경계 내로 cross-modality attention map을 한정한다.

Bounding box 이외의 범위는 1로, 그 이외의 범위는 0으로 masking한다. 이 bounding box 내부의 attention을 최대화하고, 그 외부는 최소화한다.

Loss로 사용

관심 있는 객체의 부분을 1로 masking, 이외의 부분은 0으로 masking (이진 마스크) 처리.

Box 내부 평균 attention을 크게, 외부 평균을 작게 만들어야함.

Box 내부 평균 attention 크게 → '1 - (평균)' 최소화

Box 외부 평균 attentino 작게 → '평균'을 최소화

Overall training objective

cross-entropy loss + weak AV alignment loss + attention consistency loss

Numerical Representation of Box Location and Time Segment Representation of Box Location

Natural language sequence 내에 수치 값을 사용하여 bounding box 위치를 embedding한다. box는 직관적으로 좌상단과 우하단 모서리로 표현된다. 이 값들은 정규화되며, 정규화 계수는 해당 bounding box가 속한 image size에 의해 결정되다. 이러한 좌표는 task에 따라 input sequence or output sequence에 나타날 수 있다.

Audio Reffered Image Grounding task: MEERKAT이 관심 객체의 bounding box 예측

Audio-Visual Fact-checking: MEERKAT에 대한 text input이 box 좌표를 포함할 수 있음

Natural language expression 내에서 수치 값을 사용해 time interval information을 embedding한다. time interval은 직관적으로 시작 및 종료 지점([start, end])으로 표현되며, 이는 event or activity를 뜻한다. box와 마찬가지로, 이러한 표현은 task에 따라 input or output sequence에 나타날 수 있다.

Image Guided Audio Temporal Localization: model이 query가 발생했을 가능성이 있는 시간 구간을 예측

Audio-Visual Fact-checking: input sequence가 reference time window를 포함할 수 있다.

~~(지시문 준비 형식에 대한 더 자세한 내용은 Appendix 참고)~~

MEERKATBENCH: A Unified Benchmark Suite for Fine-grained Audio-Visual Understanding

새로운 audio-visual fine-grained task unification task를 소개한다. 이를 위해 MEEARKATBENCH를 제시하는데 세 가지 fine-grained task와 두 가지 coarse-grained task로 구성된다.

fine1. audio reffered image grounding

fine2. image guided audio temporal localization

fine3. audio-visual fact-checking

coarse1. audio-visual question answering

coarse2. audio-visual captioning

해당 논문을 읽으면서 가장 인상 깊었던? 흥미로웠던 부분은 관심 있는 객체에만 bounding box를 친 후, 이 부분만 cross-attention을 진행한다는 점이다. 그러면 당연히 필요없는 부분도 cross-attention 연산을 하게 될텐데, 그게 아니라 꼭 필요한 부분, 원하는 부분만 attention을 할 수 있으니 너무나도 효율적이라고 생각이 들었다.

'이 부분을 잘 살려서 적용해보면 좋지 않을까' 라는 생각을 했다.