這篇博客會大概講解一下論文的工作，以及一些VQA 領域的近況，也會涉及到一些自己的見解。一些容易誤解的地方，我會盡量的表達細緻，方便讀者理解。如果需要深入研究，推薦自行再品讀該論文：https://jingchenchen.github.io/files/papers/2020/AAAI_Decom_VQA.pdf
衷心希望這篇博客能有助於大家的科研工作

文章目錄

總結與感悟

前言

這裏我首先介紹一下Visual Question Answering（以下簡稱VQA）領域的language prior problem：

Most existing Visual Question Answering (VQA) models overly rely on superficial correlations between questions and answers.
For example, they may frequently answer “white” for questions about
color, “tennis” for questions about sports, no matter what images are given with the questions.、

簡而言之，就是對於訓練的Question與Image數據，模型並沒有學會依照Image來回答問題，而只是簡單的依賴answer的比例。比如對於what color這類question，答案爲white佔比爲80%，那麼當輸入這類問題，模型就直接回答爲white，而完全不需要依照Image，且這樣的正確率很高。
相關工作：其實針對language prior的工作已經有不少了，比如18年nips的Overcoming Language Priors in VQA with Adversarial Regularization，以及CVPR的 Don’t Just Assume; Look and Answer: Overcoming Priors for VQA，等等。另外，還有我們團隊的19年sigir的工作：Quantifying and Alleviating the Language Prior Problem in VQA。感興趣的可以去看看論文。

文章概述

如何解決language prior problem一直是VQA任務的一大難點，這篇文章從question的角度出發，基於 Don’t Just Assumee; Look and Answer: Overcoming Priors for VQA那篇工作進一步延伸，對question進行了分解表示，消除了疑問詞所帶來的language prior，再依據Image信息進行預測answer。值得一提的是，它並不同於以前的 Neural Module Networks。且可以清晰的呈現model預測answer的過程。下面我拆分成Question decomposition和Answer prediction兩部分介紹一下整個模型運行的過程。

Question decomposition：

因爲question-answer pair常常包含三部分信息：question type, referring object, and expected concept. 所以作者將question分爲了以上三部分。
因爲存在question是否包含expected concept的問題，作者將question分爲兩種情況進行處理：yes/no和not yes/no。
具體的question representation如下面的示例圖所示。

Answer prediction:
注意這裏的question type只用來確定answer set，也就是這類question下的所以answer集合。它並沒有直接參與到最終的answer預測，所以纔會有language prior的減輕

如果question屬於yes/no這類，那麼它的anwer集爲{yes, no}，其它的一律去掉。然後通過q_obj和Image信息採用up-down attention定位region，最後再和q_con混合，進行二分類
如果question不屬於yes/no這類，那麼首先需要用q_type來預測answer set，然後用q_obj與Image進行soft attention得到最終的image represention（與上面相同，）。最後，計算answer set中每個answer的得分即可。

方法介紹

概述上對整個模型講的較爲籠統，這裏我儘量細緻的講解一下作者設計的各個模塊，以及如何train各個模塊。不太清楚的可以仔細看下上面的圖。

這裏首先看一下作者原文：

The proposed method includes four modules:
(a) a language attention module parses a ques-tion into the type representation, the object representation, and the concept representation;
(b) a question identification module uses the type representation to identify the question type and possible answers;
(c) an object referring module uses the object representation to attend to the relevant re-gion of an image;
(d) a visual verification module measures the relevance between the attended region and the concept representation to infer the answer.

language attention module

整體如上圖所示，分爲三個部分：Type attention，Object attention， Concept attention。

Type att: $w_a,t$ 可以理解我question type的vector表示，e爲question中的word embedding，共T個words。下圖的公式是用soft attention計算question-type的向量表示。注：爲了能夠準確定位到question type包含哪些詞（如下 $\alpha_i^{type}$ 越大，說明這個詞是疑問詞的概率越大），作者採用VQA dataset數據中已有的question type label對模型進行訓練
**Object att & Concept att：**這兩部分幾乎一致，與上面類似，就不多做介紹了：

Question Identification module

首先對q_type計算cross-entropy判斷question是否爲yes/no類型：
若非yes/no類型，計算Q&A masks，也就是answer set的masks(m_q表示)，用來遮住不需要的answers。計算公式如下：KL散度計算maks與真實answer set的距離，並優化

$s_q = q_{type} ·a_j, m_q=sigmoid(s_q)$

$a_j$ 爲anwer的embedding表示，q_type爲前面計算出來的type representation。
若爲yes/no類型：answer set={yes， no}

Object Referring Module

常用的up-down的soft attention模塊，公式如下:

Visual Verification Module

對於question爲yes/no類型，利用q_con與attention後的image表示來預測score：

$score = sigmoid(q_{con}·v)$
對於question不爲yes/no類型，計算answer_j的概率：

$s_v = a_j·v, s_{vqa}=m_q\circ s_v$

其中m_q爲前面計算的answer set的masks

總結與感悟

這篇文章抓住了VQA task目前的一個缺漏，也就是question representation。關於這個點，其實我之前也有類似的想法，不過並不是從language prior的角度出發，而是設計了一個mult-task learning的方案，讓模型針對各類question type分別進行處理。但是我卡在了模型的複雜程度上。在這篇文章中，它其實僅將question分爲yes/no和非yes/no兩類來處理，簡化了任務。而且它還巧妙的引入了alleviate language prior這個中心思想，確實算是一個不錯的工作。
關於最後的實驗結果，也是相當不錯：

不過我有一點疑惑，在VQA-CP上的number類型的數據準確率依舊不高，相對其他兩類有很大的缺漏。後續我也會做進一步的研究，希望能取得進展。

【論文筆記-AAAI2020】Overcoming Language Priors in VQA via Decomposed Linguistic Representations

文章目錄

前言

文章概述

方法介紹

language attention module

Question Identification module

Object Referring Module

Visual Verification Module

總結與感悟

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

關於指針和引用的理解

關於Visual Question Answering Eval

關於GAN的基礎學習

【論文筆記-AAAI2020】Overcoming Language Priors in VQA via Decomposed Linguistic Representations

TeXstudio顯示段落出現重疊或不清晰的解決方法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結