多類別(class)

Multiclass classification: classification task with more than two classes. Each sample can only be labelled as one class.

多標籤(label)

Multilabel classification: classification task labelling each sample with x labels from n_classes possible classes, where x can be 0 to n_classes inclusive.

超大規模任務

class 類別數過多, 就需要在子集上作分類訓練了, 第一步就是候選採樣.
樣本 一詞在不同語境有混用, 有時指數據集中的一條記錄, 有時指一條記錄對應的label.

候選採樣

candidate sampling ,詳見參考[3].
對於給定的訓練樣本, 記其 label 爲 $t_i$ , 我們需要從 class 全集 $L$ 中採樣一些 class, 記爲 $S$ , 作爲當前樣本的負label.

負label集合中不能含有當前樣本的正label, 避免衝突
採用 log-uniform sampler

數學原理

常規情況下, 我們計算的是 $P(t_i=y|x)$ , 有采樣的情況下, 計算的是 $P(t_i=y|x,C_i)$ . 所以
$Training\ Softmax\ Input = F(x,y)-\log(Q(y|x)) \tag 1$

Zipfian Distribution

見參考[4]。
Zipfian 是美國的語言學家，指出一個自然語料庫中，一個單詞出現的頻率與它的排名序號的常數次冪存在簡單的反比關係：
$P(r) \sim r^{-\alpha}$ ，即 $P(r)=\frac {C}{r^{\alpha}}$

Probability mass function

圖： A plot of the rank versus frequency for the first 10 million words in 30 Wikipedias (dumps from October 2015) in a log-log scale.

log-log plot

見參考[5].
對於形如 $y=ax^k$ 這樣的冪函數，兩邊取對數後， $\log y=\log a+k\log x$ , 令 $Y=\log y, X=\log x$ , 就可以得到 $Y=mX+b$ ，是線性函數。此時對橫縱兩個座標軸的尺度取對數，就得到了下圖。

圖： log-log scale plot 示意

論文實踐

推薦系統的召回任務, [2]是這樣的, 可以代表普通做法.

參考

sk-learn, Multiclass and multilabel algorithms
SDM: Sequential Deep Matching Model for Online Large-scale Recommender System
tf-reference. What is Candidate Sampling
wikipedia, Zipf‘s law
wikipedia, log-log plot
blog,統計分佈-Zipf分佈

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

分類任務簡述及超大規模任務設計

多類別(class)

多標籤(label)

超大規模任務

候選採樣

數學原理

Zipfian Distribution

log-log plot

論文實踐

參考

bert 及 GPT

基於Bert的Vison-Language多模態網絡

分類任務簡述及超大規模任務設計

SR-GNN, 圖網絡召回

推薦系統常用評估指標

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結