Circle Loss: A Unified Perspective of Pair Similarity Optimization

4. Experiment

Abstract

This paper provides a pair similarity optimization view- point on deep feature learning, aiming to maximize the within-class similarity sp and minimize the between-class similarity sn. We find a majority of loss functions, includ- ing the triplet loss and the softmax plus cross-entropy loss, embed sn and sp into similarity pairs and seek to reduce (sn − sp). Such an optimization manner is inflexible, be- cause the penalty strength on every single similarity score is restricted to be equal. Our intuition is that if a similarity score deviates far from the optimum, it should be empha- sized. To this end, we simply re-weight each similarity to highlight the less-optimized similarity scores. It results in a Circle loss, which is named due to its circular decision boundary. The Circle loss has a unified formula for two elemental deep feature learning approaches, i.e., learning with class-level labels and pair-wise labels. Analytically, we show that the Circle loss offers a more flexible optimiza- tion approach towards a more definite convergence target, compared with the loss functions optimizing (sn − sp). Ex- perimentally, we demonstrate the superiority of the Circle loss on a variety of deep feature learning tasks. On face recognition, person re-identification, as well as several fine- grained image retrieval datasets, the achieved performance is on par with the state of the art.

本文提供了關於深度特徵學習的一對相似度優化觀點，旨在最大化類內相似度sp並最小化類間相似度sn。我們發現了大多數損失函數，包括三重態損失和softmax加上交叉熵損失，將sn和sp嵌入相似對並試圖降低（sn-sp）。這種優化方式是不靈活的，因爲每個相似度得分上的懲罰強度都被限制爲相等。我們的直覺是，如果相似性得分偏離最佳值，則應強調它。爲此，我們僅需對每個相似度重新加權，以突出顯示未優化的相似度得分。這會導致圓損失，由於其圓形決策邊界而被命名。 Circle loss具有兩種基本的深度特徵學習方法的統一公式，即使用類級標籤和成對標籤進行學習。從分析上，我們表明，與損失函數優化（sn-sp）相比，Circle損失爲更確定的收斂目標提供了更靈活的優化方法。實驗上，我們證明了Circle Loss在各種深度特徵學習任務中的優越性。在人臉識別，人員重新識別以及幾個細粒度的圖像檢索數據集上，實現的性能與最新技術水平相當。

Figure 1: Comparison between the popular optimization manner of reducing (sn −sp) and the proposed optimization manner of reducing (αnsn −αpsp). (a) Reducing (sn −sp) is prone to inflexible optimization (A, B and C all have equal gradients with respect to sn and sp), as well as am- biguous convergence status (both T and T ′ on the decision boundary are acceptable). (b) With (αnsn −αpsp), the Cir- cle loss dynamically adjusts its gradients on sp and sn, and thus benefits from flexible optimization process. For A, it emphasizes on increasing sp; for B, it emphasizes on reduc- ing sn. Moreover, it favors a specified point T on the circu- lar decision boundary for convergence, setting up a definite convergence target.

圖1：流行的還原優化方法（sn -sp）與建議的還原優化方法（αnsn-αpsp）比較。（a）減少（sn -sp）傾向於不靈活的優化（A，B和C相對於sn和sp都具有相同的梯度），以及模糊的收斂狀態（決策邊界上的T和T’都一樣）是可以接受的）。（b）利用（αnsn-αpsp），圓環損耗動態地調整其在sp和sn上的梯度，因此得益於靈活的優化過程。對於A，它強調增加sp。對於B，它強調減少sn。此外，它有利於在圓周決策邊界上指定點T進行收斂，從而確定確定的收斂目標。

1. Introduction

This paper holds a similarity optimization view towards two elemental deep feature learning approaches, i.e., learn- ing from data with class-level labels and from data with pair-wise labels. The former employs a classification loss function (e.g., Softmax plus cross-entropy loss [25, 16, 36]) to optimize the similarity between samples and weight vec- tors. The latter leverages a metric loss function (e.g., triplet loss [9, 22]) to optimize the similarity between samples. In our interpretation, there is no intrinsic difference between these two learning approaches. They both seek to minimize between-class similarity sn, as well as to maximize within- class similarity sp.
From this viewpoint, we find that many popular loss functions (e.g., triplet loss [9, 22], Softmax loss and its vari- ants [25, 16, 36, 29, 32, 2]) share a similar optimization pat- tern. They all embed sn and sp into similarity pairs and seek to reduce (sn −sp). In (sn −sp), increasing sp is equivalent to reducing sn. We argue that this symmetric optimization manner is prone to the following two problems.
• Lack of flexibility for optimization. The penalty strength on sn and sp is restricted to be equal. Given the specified loss functions, the gradients with respect to sn and sp are of same amplitudes (as detailed in Section 2). In some corner cases, e.g., sp is small and sn already ap- proaches 0 (“A” in Fig. 1 (a)), it keeps on penalizing sn with large gradient. It is inefficient and irrational.

本文針對兩種基本的深度特徵學習方法（即從具有類級別標籤的數據和具有成對標籤的數據中學習）持相似性優化觀點。前者採用分類損失函數（例如，Softmax加交叉熵損失[25，16，36]）來優化樣本和權重向量之間的相似度。後者利用度量損失函數（例如，三重損失[9，22]）來優化樣本之間的相似度。在我們的解釋中，這兩種學習方法之間沒有內在的區別。他們都試圖最小化類間相似度sn，以及最大化類內相似度sp。
從這個角度來看，我們發現許多流行的損失函數（例如，三重損失[9，22]，Softmax損失及其變體[25、16、36、29、32、2]）具有相似的優化模式。它們都將sn和sp嵌入相似對，並尋求減小（sn -sp）。在（sn -sp）中，增加sp等於減少sn。我們認爲這種對稱優化方式容易出現以下兩個問題。
•缺乏優化的靈活性。 sn和sp的懲罰強度被限制爲相等。給定指定的損耗函數，關於sn和sp的梯度具有相同的幅度（如第2節中所述）。在某些極端情況下，例如，sp很小，並且sn已經接近0（圖1（a）中的“ A”），它會繼續以大梯度懲罰sn。它效率低下且不合理。

• Ambiguous convergence status. Optimizing (sn−sp)
•收斂狀態不明確。優化（sn-sp）

Being simple, Circle loss intrinsically reshapes the char- acteristics of the deep feature learning from the following three aspects:

First, a unified loss function. From the unified similar- ity pair optimization perspective, we propose a unified loss function for two elemental learning approaches, learning with class-level labels and with pair-wise labels.
Second, flexible optimization. During training, the gradient back-propagated to sn (sp) will be amplified by αn (αp). Those less-optimized similarity scores will have larger weighting factors and consequentially get larger gra- dient. As shown in Fig. 1 (b), the optimization on A, B and C are different to each other.
Third, definite convergence status. On the circular de- cision boundary, Circle loss favors a specified convergence status (“T” in Fig. 1 (b)), as to be demonstrated in Sec- tion 3.3. Correspondingly, it sets up a definite optimization target and benefits the separability.
The main contributions of this paper are summarized as follows:

簡單來說，圓損失從以下三個方面從本質上重塑了深度特徵學習的特徵：

一是統一虧損功能。從統一相似度對優化的角度來看，我們爲兩種基本學習方法（使用類級標籤和逐對標籤的學習方法）提出了統一的損失函數。
第二，靈活優化。在訓練期間，反向傳播到sn（sp）的梯度將被αn（αp）放大。那些不太優化的相似性分數將具有較大的權重因子，因此會獲得較大的梯度。如圖1（b）所示，對A，B和C的優化互不相同。
第三，確定收斂狀態。在圓弧決策邊界上，圓弧損耗傾向於指定的收斂狀態（圖1（b）中的“ T”），如第3.3節所示。相應地，它建立了明確的優化目標並有利於可分離性。
本文的主要貢獻概述如下：

• We propose Circle loss, a simple loss function for deep feature learning. By re-weighting each similarity score under supervision, Circle loss benefits the deep feature learning with flexible optimization and definite conver- gence target.
• We present Circle loss with compatibility to both class- level labels and pair-wise labels. Circle loss degener- ates to triplet loss or Softmax loss with slight modifi- cations.

• We conduct extensive experiment on a variety of deep feature learning tasks, e.g. face recognition, person re- identification, car image retrieval and so on. On all these tasks, we demonstrate the superiority of Circle loss with performance on par with the state of the art.

•我們提出了Circle損失，這是用於深度特徵學習的簡單損失功能。通過在監督下對每個相似度得分重新加權，Circle loss通過靈活的優化和確定的收斂目標而受益於深度特徵學習。
•我們提出了Circle損失，它與類級別標籤和成對標籤都兼容。圓度損失經過輕微的修改，可退化爲三重態損失或Softmax損失。

•我們對各種深度特徵學習任務進行了廣泛的實驗，例如人臉識別，人員重新識別，汽車圖像檢索等。在所有這些任務上，我們展示了Circle損失與性能相媲美的優越性。

2. A Unified Perspective

Deep feature learning aims to maximize the within-class similarity sp, as well as to minimize the between-class sim- ilarity sn. Under the cosine similarity metric, for example, weexpectsp →1andsn →0.
To this end, learning with class-level labels and learn- ing with pair-wise labels are two paradigms of approaches and are usually considered separately. Given class-level labels, the first one basically learns to classify each train- ing sample to its target class with a classification loss, e.g. L2-Softmax [21], Large-margin Softmax [15], Angu- lar Softmax [16], NormFace [30], AM-Softmax [29], Cos- Face [32], ArcFace [2]. In contrast, given pair-wise la- bels, the second one directly learns pair-wise similarity in the feature space in an explicit manner, e.g., constrastive loss [5, 1], triplet loss [9, 22], Lifted-Structure loss [19], N-pair loss [24], Histogram loss [27], Angular loss [33], Margin based loss [38], Multi-Similarity loss [34] and so on.

深度特徵學習旨在最大化類內相似度sp，以及最小化類間相似度sn。在餘弦相似度度量下，例如，weexpectsp→1andsn→0。
爲此，使用類級別的標籤學習和使用逐對標籤的學習是方法的兩個範例，通常被單獨考慮。給定班級級別的標籤，第一個基本上學會通過分類損失（例如， L2-Softmax [21]，大邊距Softmax [15]，Angular角Softmax [16]，NormFace [30]，AM-Softmax [29]，CosFace [32]，ArcFace [2]。相反，給定成對標籤，第二個以顯式方式直接學習特徵空間中的成對相似性，例如，對比損失[5，1]，三重態損失[9，22]，提升結構損失[19]，N對損失[24]，直方圖損失[27]，角度損失[33]，邊際損失[38]，多相似損失[34]等。

Figure 2: The gradients of the loss functions. (a) Triplet loss. (b) AMSoftmax loss. © The proposed Circle loss. Both triplet loss and AMSoftmax loss present lack of flexibility for optimization. The gradients with respect to sp (left) and sn (right) are restricted to equal and undergo a sudden decrease upon convergence (the similarity pair B). For example, at A, the within-class similarity score sp already approaches 1, and still incurs large gradient. Moreover, the decision boundaries are parallel to sp = sn, which allows ambiguous convergence. In contrast, the proposed Circle loss assigns different gradients to the similarity scores, depending on their distances to the optimum. For A (both sn and sp are large), Circle loss lays emphasis on optimizing sn. For B, since sn significantly decreases, Circle loss reduces its gradient and thus enforces mild penalty. Circle loss has a circular decision boundary, and promotes accurate convergence status.

圖2：損失函數的梯度。（a）三重態損失。（b）AMSoftmax損失。（c）建議的循環損失。三重態損失和AMSoftmax損失都缺乏優化的靈活性。相對於sp（左）和sn（右）的梯度被限制爲相等，並且在收斂時（相似性對B）會突然減小。例如，在A處，類內相似性得分sp已經接近1，並且仍會產生較大的梯度。此外，決策邊界與sp = sn平行，從而允許模棱兩可的收斂。相比之下，擬議的Circle損失會根據相似度得分與最佳評分之間的距離將不同的梯度分配給相似度得分。對於A（sn和sp都很大），Circle loss將重點放在優化sn上。對於B，由於sn顯着降低，因此Circle損失會減小其梯度，因此會施加輕微的懲罰。圓損失具有圓形決策邊界，並可以提高準確的收斂狀態。

generates to AM-Softmax [29, 32], an important variant of Softmax loss:

生成AM-Softmax [29，32]，Softmax損失的重要變體：

Moreover, with m = 0, Eq. 2 further degenerates to Normface [30]. By replacing the cosine similarity with in- ner product and setting γ = 1, it finally degenerates to Soft- max loss (i.e., softmax plus cross-entropy loss).

此外，當m = 0時， 2進一步退化爲Normface [30]。通過用內積代替餘弦相似度並將γ= 1設置，它最終退化爲Softmax損失（即softmax加上交叉熵損失）。

Specifically, we note that in Eq. 3, the “􏰅 exp(·)” op- eration is utilized by Lifted-Structure loss [19], N-pair loss [24], Multi-Similarity loss [34] and etc., to conduct “soft” hard mining among samples. Enlarging γ gradually reinforces the mining intensity and when γ → +∞, it re- sults in the canonical hard mining in [22, 8].

具體來說，我們注意到在等式中。如圖3所示，“􏰅exp（·）”運算可用於提升結構損失[19]，N對損失[24]，多重相似損失[34]等，以進行“軟”硬開採在樣本中。增大γ會逐漸增強開採強度，當γ→+∞時，會導致[22，8]中的規範硬開採。

Gradient analysis. Eq. 2 and Eq. 3 show triplet loss, Softmax loss and its several variants can be interpreted as specific cases of Eq. 1. In another word, they all optimize (sn − sp). Under the toy scenario where there are only a single sp and sn, we visualize the gradients of triplet loss and AMSoftmax loss in Fig. 2 (a) and (b), from which we draw the following observations:

• First, before the loss reaches its decision boundary (upon which the gradients vanish), the gradients with respect to both sp and sn are the same to each other. The status A has {sn , sp } = {0.8, 0.8}, indicating good within-class compactness. However, A still re- ceives large gradient with respect to sp. It leads to lack of flexibility during optimization.
• Second, the gradients stay (roughly) constant before convergence and undergo a sudden decrease upon con- vergence. The status B lies closer to the decision boundary and is better optimized, compared with A. However, the loss functions (both triplet loss and AM- Softmax loss) enforce approximately equal penalty on A and B. It is another evidence of inflexibility.

梯度分析。等式2和等式圖3顯示了三重態損失，Softmax損失及其幾個變體可以解釋爲等式的特定情況。換句話說，它們都優化（sn-sp）。在只有一個sp和sn的玩具場景下，我們可視化圖2（a）和（b）中的三重態損失和AMSoftmax損失的梯度，從中得出以下觀察結果：

•首先，在損失到達其決策邊界（梯度消失時）之前，相對於sp和sn的梯度彼此相同。狀態A的{sn，sp} = {0.8，0.8}，表示良好的類內緊湊性。但是，A相對於sp仍會收到較大的梯度。這導致優化期間缺乏靈活性。
•其次，梯度在收斂之前保持（大致）恆定，並且在收斂時突然減小。與A相比，狀態B更接近決策邊界，並且優化程度更好。但是，損失函數（三元組損失和AM-Softmax損失）對A和B施加的懲罰幾乎相等。這是不靈活的另一證據。

These problems originate from the optimization manner of minimizing (sn − sp ), in which reducing sn is equivalent to increasing sp. In the following Section 3, we will trans- fer such an optimization manner into a more general one to facilitate higher flexibility.

這些問題源於最小化（sn-sp）的優化方式，其中減小sn等於增大sp。在下面的第3節中，我們將把這種優化方式轉換爲更通用的方式，以提高靈活性。

3. A New Loss Function

3.1. Self-paced Weighting

We consider to enhance the optimization flexibility by allowing each similarity score to learn at its own pace, de- pending on its current optimization status. We first neglect the margin item m in Eq. 1 and transfer the unified loss function (Eq. 1) into the proposed Circle loss by:

我們考慮通過允許每個相似性得分根據其當前優化狀態按照自己的步調學習，從而提高優化靈活性。我們首先忽略等式中的保證金項目m。 1並將統一損失函數（等式1）轉移到建議的圓損失中，方法是：

Re-scaling the cosine similarity under supervision is a
common practice in modern classification losses [21, 30, 29, 32, 39, 40]. Conventionally, all the similarity score share an equal scale factor γ. The non-normalized weighting op- eration in Circle loss can be also interpreted as a specific scaling operation. Different from the other loss functions, Circle loss re-weights (re-scales) each similarity score in- dependently and thus allows different learning paces. We empirically show that Circle loss is robust to various γ set- tings in Section 4.5.

Discussions. We notice another difference beyond the scaling strategy. The output of softmax function in a classi- fication loss is conventionally interpreted as the probability of a sample belonging to a certain class. Since the probabil- ities are based on comparing each similarity score against all the similarity scores, equal re-scaling is prerequisite for fair comparison. Circle loss abandons such an probability- related interpretation and holds a similarity pair optimiza- tion perspective, instead. Correspondingly, it gets rid of the constraint of equal re-scaling and allows more flexible opti- mization.

在監督下重新縮放餘弦相似度是
現代分類損失的常見做法[21，30，29，32，39，40]。傳統上，所有相似度分數共享相等的比例因子γ。 Circle loss中的非歸一化加權運算也可以解釋爲特定的縮放操作。與其他損失函數不同，Circle損失獨立地對每個相似度評分進行重新加權（重新定標），因此允許不同的學習進度。我們根據經驗證明，在第4.5節中，圓損失對各種γ設置都具有魯棒性。

討論。我們注意到縮放策略之外的另一個區別。通常將分類損失中softmax函數的輸出解釋爲樣本屬於某個類別的概率。由於概率是基於將每個相似性得分與所有相似性得分進行比較而得出的，所以相等的重新定標是進行公平比較的前提。圓損失放棄了這種與概率有關的解釋，取而代之的是擁有相似對對的觀點。相應地，它擺脫了均等縮放的約束，並允許更靈活的優化。

3.2. Within-class and Between-class Margins

In loss functions optimizing (sn − sp), adding a margin m reinforces the optimization [15, 16, 29, 32]. Since sn and −sp are in symmetric positions, a positive margin on sn is equivalent to a negative margin on sp. It thus only requires a single margin m. In Circle loss, sn and sp are in asymmetric position. Naturally, it requires respective margins for sn and sp, which is formulated by:

在損失函數優化（sn-sp）中，增加餘量m可以加強優化[15、16、29、32]。由於sn和-sp處於對稱位置，因此sn上的正餘量等於sp上的負餘量。因此，僅需要單個餘量m。在“圓損耗”中，sn和sp處於不對稱位置。自然，它需要爲sn和sp分別設置邊距，其公式如下：

With the decision boundary defined in Eq. 8, we have another intuitive interpretation of Circle loss. It aims to op- timize sp → 1 and sn → 0. The parameter m controls the radius of the decision boundary and can be viewed as a relaxation factor. In another word, Circle loss expects

在等式中定義決策邊界。 8，我們對圓損有另一種直觀的解釋。它旨在優化sp→1和sn→0。參數m控制決策邊界的半徑，可以將其視爲鬆弛因子。換句話說，Circle loss期望

Hence there are only two hyper-parameters, i.e., the scale factor γ and the relaxation margin m. We will experimen- tally analyze the impacts of m and γ in Section 4.5.

因此，只有兩個超參數，即比例因子γ和弛豫裕度m。我們將在第4.5節中對m和γ的影響進行實驗分析。

3.3. The Advantages of Circle Loss

The gradients of Circle loss with respect to sjn and sip are derived as follows:

圓損耗相對於sjn和sip的梯度推導如下：

Under the toy scenario of binary classification (or only
a single sn and sp), we visualize the gradients under dif- ferent settings of m in Fig. 2 ©, from which we draw the following three observations:
• Balanced optimization on sn and sp. We recall that the loss functions minimizing (sn − sp ) always have equal gra- dients on sp and sn and is inflexible. In contrast, Circle loss presents dynamic penalty strength. Among a specified sim- ilarity pair {sn, sp}, if sp is better optimized in comparison to sn (e.g., A = {0.8, 0.8} in Fig. 2 ©), Circle loss assigns larger gradient to sn (and vice versa), so as to decrease sn with higher superiority. The experimental evidence of bal- anced optimization is to be accessed in Section 4.6.

• Gradually-attenuated gradients. At the start of train- ing, the similarity scores deviate far from the optimum and gains large gradient (e.g., “A” in Fig. 2 ©). As the train- ing gradually approaches the convergence, the gradients on the similarity scores correspondingly decays (e.g., “B” in Fig. 2 ©), elaborating mild optimization. Experimental re- sult in Section 4.5 shows that the learning effect is robust to various settings of γ (in Eq. 6), which we attribute to the automatically-attenuated gradients.

在二元分類的玩具場景下（或僅
單個sn和sp），我們在圖2（c）的m的不同設置下可視化了梯度，從中我們得出以下三個觀察結果：
•對sn和sp的均衡優化。我們記得損失函數最小化（sn-sp）在sp和sn上總是具有相等的梯度，並且是不靈活的。相反，圓損失表現出動態的懲罰強度。在指定的相似對{sn，sp}中，如果與sn相比，sp的優化效果更好（例如，圖2（c）中的A = {0.8，0.8}），則圓損耗爲sn分配了較大的梯度（並且反之亦然），以便以更高的優勢減少sn。平衡優化的實驗證據將在第4.6節中提供。

•逐漸衰減的漸變。在訓練開始時，相似度得分偏離最佳值，並獲得了較大的梯度（例如，圖2（c）中的“ A”）。隨着訓練逐漸趨於收斂，相似度分數上的梯度相應地衰減（例如，圖2（c）中的“ B”），從而進行了適度的優化。 4.5節中的實驗結果表明，學習效果對於γ的各種設置（在等式6中）是魯棒的，我們將其歸因於自動衰減的梯度。

• A (more) definite convergence target. Circle loss has a circular decision boundary and favors T rather than T′ (Fig. 1) for convergence. It is because T has the smallest gap between sp and sn, compared with all the other points on the decision boundary. In another word, T ′ has a larger gap between sp and sn and is inherently more difficult to maintain. In contrast, losses that minimize (sn − sp) have a homogeneous decision boundary, that is, every point on the decision boundary is of the same difficulty to reach. Ex- perimentally, we observe that Circle loss leads to a more concentrate similarity distribution after convergence, as to be detailed in Section 4.6 and Fig. 5.

•（更多）明確的收斂目標。圓損失具有圓形決策邊界，並且傾向於T而不是T’（圖1）進行收斂。這是因爲與決策邊界上的所有其他點相比，T在sp和sn之間的間隙最小。換句話說，T′在sp和sn之間具有較大的間隙，並且固有地更難以維護。相反，最小化（sn-sp）的損失具有均勻的決策邊界，也就是說，決策邊界上的每個點都具有相同的難度。實驗上，我們觀察到圓損失會導致收斂後更集中的相似度分佈，詳見第4.6節和圖5。

4. Experiment

We comprehensively evaluate the effectiveness of Circle loss under two elemental learning approaches, i.e., learn- ing with class-level labels and learning with pair-wise la- bels. For the former approach, we evaluate our method on face recognition (Section 4.2) and person re-identification (Section 4.3) tasks. For the latter approach, we use the fine-grained image retrieval datasets (Section 4.4), which are relatively small and encourage learning with pair-wise labels. We show that Circle loss is competent under both settings. Section 4.5 analyzes the impact of the two hyper- parameters, i.e., the scale factor γ in Eq. 6 and the relaxation factor m in Eq. 8. We show that Circle loss is robust under reasonable settings. Finally, Section 4.6 experimentally confirms the characteristics of Circle loss.

我們通過兩種基本的學習方法（即，使用班級標籤學習和使用成對標籤學習）來全面評估Circle損失的有效性。對於前一種方法，我們評估我們在面部識別（第4.2節）和人員重新識別（第4.3節）任務上的方法。對於後一種方法，我們使用細粒度的圖像檢索數據集（第4.4節），該數據集相對較小，並鼓勵使用成對標籤學習。我們表明，在兩種情況下，圓環損失都可以勝任。第4.5節分析了兩個超參數（即等式1中的比例因子γ）的影響。 6和等式中的鬆弛因子m。 8.我們證明，在合理的設置下，圓環損耗是可靠的。最後，第4.6節通過實驗證實了圓環損耗的特徵。

4.1. Settings

Face recognition. We use the popular dataset MS- Celeb-1M [4] for training. The native MS-Celeb-1M data is noisy and has a long-tailed data distribution. We clean the dirty samples and exclude the tail identities (≤ 3 im- ages per identity). It results in 3.6M images and 79.9K identities. For evaluation, we adopt MegaFace Challenge 1 (MF1) [12], IJB-C [17], LFW [10], YTF [37] and CFP- FP [23] datasets and the official evaluation protocols. We also polish the probe set and 1M distractors on MF1 for more reliable evaluation, following [2]. For data pre- processing, we resize the aligned face images to 112 × 112 and linearly normalize the pixel values of RGB images to [−1, 1] [36, 15, 32]. We only augment the training samples by random horizontal flip. We choose the popular residual networks [6] as our backbones. All the models are trained with 182k iterations. The learning rate is started with 0.1 and reduced by 10× at 50%, 70% and 90% of total iter- ations respectively. The default hyper-parameters of our method are γ = 256 and m = 0.25 if not specified. For all the model inference, we extract the 512-D feature em- beddings and use cosine distance as metric.

人臉識別。我們使用流行的數據集MS-Celeb-1M [4]進行訓練。 MS-Celeb-1M原始數據比較嘈雜，並且數據分佈很長。我們清洗髒樣品並排除尾巴身份（每個身份≤3個圖像）。結果爲360萬張圖像和79.9K個身份。爲了進行評估，我們採用了MegaFace Challenge 1（MF1）[12]，IJB-C [17]，LFW [10]，YTF [37]和CFP-FP [23]數據集以及官方評估協議。我們還會根據[2]對MF1上的探針組和1M干擾器進行拋光，以進行更可靠的評估。對於數據預處理，我們將對齊的面部圖像調整爲112×112的大小，並將RGB圖像的像素值線性標準化爲[-1，1] [36，15，32]。我們僅通過隨機水平翻轉來增加訓練樣本。我們選擇流行的殘差網絡[6]作爲我們的骨幹網。所有模型都經過182k次迭代訓練。學習率從0.1開始，並分別以總迭代的50％，70％和90％降低10倍。如果未指定，我們方法的默認超參數爲γ= 256和m = 0.25。對於所有模型推斷，我們提取512-D特徵嵌入並使用餘弦距離作爲度量。

Person re-identification. Person re-identification (re- ID) aims to spot the appearance of a same person in dif- ferent observations. We evaluate our method on two pop- ular datasets, i.e., Market-1501 [41] and MSMT17 [35]. Market-1501 contains 1,501 identities, 12,936 training im- ages and 19,732 gallery images captured with 6 cameras. MSMT17 contains 4,101 identities, 126,411 images cap- tured with 15 cameras and presents long-tailed sample dis- tribution. We adopt two network structures, i.e. a global feature learning model backboned on ResNet50 and a part- feature model named MGN [31]. We use MGN with consid- eration of its competitive performance and relatively con- cise structure. The original MGN uses a Sofmax loss on each part feature branch for training. Our implementation concatenates all the part features into a single feature vec- tor for simplicity. For Circle loss, we set γ = 256 and m = 0.25.

人員重新識別。人物重新識別（re-ID）旨在在不同的觀察結果中發現同一個人的外表。我們在兩個受歡迎的數據集（即Market-1501 [41]和MSMT17 [35]）上評估了我們的方法。 Market-1501包含1,501個身份，12,936個訓練圖像和用6個攝像機捕獲的19,732個畫廊圖像。 MSMT17包含4,101個身份，用15個攝像機捕獲的126,411張圖像，並顯示了長尾樣本分佈。我們採用兩種網絡結構，即基於ResNet50的全局特徵學習模型和名爲MGN的部分特徵模型[31]。我們使用MGN時要考慮其競爭性能和相對簡潔的結構。原始MGN在每個零件特徵分支上使用Sofmax損失進行訓練。爲了簡化起見，我們的實現將所有零件特徵連接到一個特徵向量中。對於圓損失，我們將γ設置爲256，將m設置爲0.25。

Fine-grained image retrieval. We use three datasets for evaluation on fine-grained image retrieval, i.e. CUB- 200-2011 [28], Cars196 [14] and Stanford Online Prod- ucts [19]. CARS-196 contains 16, 183 images which be- longs to 196 class of cars. The first 98 classes are used for training and the last 98 classes are used for testing. CUB- 200-2010 has 200 different class of birds. We use the first 100 class with 5, 864 images for training and the last 100 class with 5, 924 images for testing. SOP is a large dataset consists of 120, 053 images belonging to 22, 634 classes of online products. The training set contains 11, 318 class includes 59,551 images and the rest 11,316 class includes 60, 499 images are for testing. The experimental setup fol- lows [19]. We use BN-Inception [11] as the backbone to learn 512-D embeddings. We adopt P-K sampling trat- egy [8] to construct mini-batch with P = 16 and K = 5. For Circle loss, we set γ = 80 and m = 0.4.

Table 1: Identification rank-1 accuracy (%) on MFC1 dataset with different backbones and loss functions.

細粒度的圖像檢索。我們使用三個數據集評估細粒度的圖像，即CUB-200-2011 [28]，Cars196 [14]和斯坦福在線產品[19]。 CARS-196包含16張183張圖像，屬於196類汽車。前98個班級用於培訓，後98個班級用於測試。 CUB- 200-2010有200種不同的鳥類。我們使用前100個班級提供5張864張圖像進行訓練，最後100個班級使用5張924張圖像進行測試。 SOP是一個大型數據集，包含120、053個圖像，這些圖像屬於22、634類在線產品。訓練集包含11個，318個類，包括59,551張圖像，其餘11,316個類，包括60個，499張圖像供測試。實驗設置如下[19]。我們使用BN-Inception [11]作爲骨幹學習512-D嵌入。我們採用P-K採樣策略[8]來構建P = 16和K = 5的小批量。對於圓損失，我們將γ= 80設置爲m = 0.4。

表1：具有不同主幹和損失函數的MFC1數據集的識別等級1準確性（％）。

Table 2: Face verification accuracy (%) on LFW, YTF and CFP-FP with ResNet34 backbone.

表2：使用ResNet34主幹的LFW，YTF和CFP-FP上的人臉驗證準確性（％）。

Table 3: Comparison of true accept rates (%) on the IJB-C 1:1 verification task.

表3：IJB-C 1：1驗證任務的真實接受率（％）的比較。

Table 4: Evaluation of Circle loss on re-ID task. We report R-1 accuracy (%) and mAP (%).

表4：重新ID任務的圓環損失評估。我們報告了R-1準確性（％）和mAP（％）。

4.2. Face Recognition

For face recognition task, we compare Circle loss against several popular classification loss functions, i.e., vanilla Softmax, NormFace [30], AM-Softmax [29] (or CosFace [32]), ArcFace [2]. Following the original pa- pers [29, 2], we set γ = 64, m = 0.35 for AM-Softmax and γ = 64, m = 0.5 for ArcFace.
We report the rank-1 accuracy on MegaFace Challenge 1 dataset (MFC1) in Table 1. On all the three backbones, Circle loss marginally outperforms the counterparts. For example, with ResNet34 as the backbone, Circle loss sur- passes the most competitive one (ArcFace) by +0.13%. With ResNet100 as the backbone, while ArcFace achieves a high rank-1 accuracy of 98.36%, Circle loss still outper- forms it by +0.14%.
Table 2 summarizes face verification results on LFW [10], YTF [37] and CFP-FP [23]. We note that perfor- mance on these datasets is already near saturation. Specif- ically, ArcFace is higher than AM-Softmax by +0.05%, +0.03%, +0.07% on three datasets, respectively. Circle loss remains the best one, surpassing ArcFace by +0.05%, +0.06% and +0.18%, respectively.
We further compare Circle loss with AM-Softmax on IJB-C 1:1 verification task in Table 3. Our implementa- tion of Arcface is unstable on this dataset and achieves abnormally low performance, so we did not compare Cir- cle loss against Arcface. With ResNet34 as the backbone, Circle loss significantly surpasses AM-Softmax by +1.30% and +4.92% on “TAR@FAR=1e-4” and “TAR@FAR=1e- 5”, respectively. With ResNet100 as the backbone, Circle loss still maintains considerable superiority.

對於人臉識別任務，我們將Circle損失與幾種流行的分類損失函數（即香草Softmax，NormFace [30]，AM-Softmax [29]（或CosFace [32]），ArcFace [2]）進行比較。按照原始文件[29，2]，我們將γ= 64，對於AM-Softmax設置爲m = 0.35，對於γ= 64，對於ArcFace設置爲m = 0.5。
我們在表1中報告了MegaFace Challenge 1數據集（MFC1）的1級準確度。在所有三個主幹上，Circle損失略勝於同行。例如，以ResNet34爲骨幹，Circle損失比最有競爭力的損失（ArcFace）高0.13％。以ResNet100爲骨幹，雖然ArcFace達到了98.36％的1級高準確度，但圓環損耗仍然比其高0.14％。
表2總結了LFW [10]，YTF [37]和CFP-FP [23]的面部驗證結果。我們注意到，這些數據集的性能已經接近飽和。具體來說，在三個數據集上，ArcFace比AM-Softmax高出+0.05％，+ 0.03％和+ 0.07％。圓環損耗仍然是最好的圓環，分別超過ArcFace +0.05％，+ 0.06％和+ 0.18％。
我們在表3中進一步將Circle loss與AM-Softmax在IJB-C 1：1驗證任務上進行了比較。我們在此數據集上實現Arcface不穩定，並且性能異常低下，因此我們沒有將Arc損失與Arcface進行比較。以ResNet34爲骨幹，在“ TAR @ FAR = 1e-4”和“ TAR @ FAR = 1e-5”上，圓環損耗分別大大超過AM-Softmax + 1.30％和+ 4.92％。以ResNet100爲骨幹，Circle Loss仍然保持相當的優勢。

4.3. Person Re-identification

We evaluate Circle loss on re-ID task in Table 4. MGN [31] is one of the state-of-the-art method and is featured for learning multi-granularity part-level features. Originally, it uses both Softmax loss and triplet loss to fa- cilitate a joint optimization. Our implementation of “MGN (ResNet50) + AMSoftmax” and “MGN (ResNet50)+ Circle loss” only use a single loss function for simplicity.
We make three observations from Table 4. First, com- paring Circle loss against state of the art, we find that Cir- cle loss achieves competitive re-ID accuracy, with a con- cise setup (no more auxiliary loss functions). We note that “JDGL” is slightly higher than “MGN + Circle loss” on MSMT17 [35]. JDGL [42] uses generative model to aug- ment the training data, and significantly improves re-ID over long-tailed dataset. Second, comparing “Circle loss” with “AMSoftmax”, we observe the superiority of Circle loss, which is consistent with the experimental results on face recognition task. Third, comparing “ResNet50 + Circle loss” against “MGN + Circle loss”, we find that part-level features bring incremental improvement to Circle loss. It implies that Circle loss is compatible to the part-model spe- cially designed for re-ID.

我們在表4中評估了關於re-ID任務的Circle損失。MGN [31]是最先進的方法之一，用於學習多粒度零件級特徵。最初，它同時使用Softmax損失和Triplet損失來促進聯合優化。爲了簡化起見，我們對“ MGN（ResNet50）+ AMSoftmax”和“ MGN（ResNet50）+圓損”的實現僅使用單個損失函數。
我們從表4中得出三個觀察結果。首先，將Circle損失與現有技術進行比較，我們發現，通過簡單的設置（沒有更多的輔助損失功能），圓形損失達到了具有競爭力的re-ID準確性。我們注意到，在MSMT17上，“ JDGL”略高於“ MGN +圓環損耗” [35]。 JDGL [42]使用生成模型來增強訓練數據，並顯着改善了長尾數據集的re-ID。其次，將“圓損”與“ AMSoftmax”進行比較，觀察到圓損的優越性，與人臉識別任務的實驗結果相吻合。第三，將“ ResNet50 +環損”與“ MGN +環損”進行比較，我們發現部件級功能爲環損帶來了增量改進。這表明Circle損失與專門爲re-ID設計的零件模型兼容。

Table 5: Comparison with state of the art on CUB-200-2011, Cars196 and Stanford Online Products. R@K(%) is reported.

表5：與CUB-200-2011，Cars196和斯坦福在線產品上的最新技術比較。報告了R @ K（％）。

Figure 3: Impact of two hyper-parameters. In (a), Circle loss presents high robustness on various settings of scale factor γ. In (b), Circle loss surpasses the best performance of both AMSoftmax and ArcFace within a large range of relaxation factor m.

圖3：兩個超參數的影響。在（a）中，圓損在比例因子γ的各種設置下表現出很高的魯棒性。在（b）中，在大的鬆弛因子m範圍內，圓環損耗超過了AMSoftmax和ArcFace的最佳性能。

Figure 4: The change of sp and sn values during training. We linearly lengthen the curves within the first 2k iterations to highlight the initial training process (in the green zone). During the early training stage, Circle loss rapidly increases sp, because sp deviates far from the optimum at the initial- ization and thus attracts higher optimization priority.

圖4：訓練期間sp和sn值的變化。我們在前2k次迭代中線性延長曲線，以突出顯示初始訓練過程（在綠色區域中）。在訓練的早期階段，Circle損失會迅速增加sp，因爲sp在初始化時偏離了最優值，因此吸引了更高的優化優先級。

4.4. Fine-grained Image Retrieval

We evaluate the compatibility of Circle loss to pair-wise labeled data on three fine-grained image retrieval datasets, i.e., CUB-200-2011, Cars196, and Standford Online Prod- ucts. On these datasets, majority methods [19, 18, 3, 20, 13, 34] adopt the encouraged setting of learning with pair- wise labels. We compare Circle loss against these state- of-the-art methods in Table 5. We observe that Circle loss achieves competitive performance, on all of the three datasets. Among the competing methods, LiftedStruct [19] and Multi-Simi [34] are specially designed with elaborate hard mining strategies for learning with pair-wise labels. HDC [18], ABIER [20] and ABE [13] benefit from model ensemble. In contrast, the proposed Circle loss achieves performance on par with the state of the art, without any bells and whistles.

我們在三個細粒度的圖像檢索數據集（即CUB-200-2011，Cars196和Standford Online產品）上評估了Circle loss與成對標記數據的兼容性。在這些數據集上，多數方法[19、18、3、20、13、34]採用成對標籤鼓勵學習。我們在表5中將Circle損失與這些最新方法進行了比較。我們觀察到，在所有三個數據集上，Circle損失均達到了競爭表現。在競爭的方法中，LiftedStruct [19]和Multi-Simi [34]是經過精心設計的，具有精心設計的硬挖掘策略，用於按對標記學習。 HDC [18]，ABIER [20]和ABE [13]受益於模型集成。相比之下，建議的Circle損失可以達到與現有技術相當的性能，而沒有任何風吹草動。

4.5. Impact of the Hyper-parameters

Figure 5: Visualization of the similarity distribution after convergence. The blue dots mark the similarity pairs crossing the decision boundary during the whole training process. The green dots mark the similarity pairs after convergence. (a) AMSoftmax seeks to minimize (sn − sp). During training, the similarity pairs cross the decision boundary through a wide passage. After convergence, the similarity pairs scatter in a relatively large region in the (sn , sp ) space. In (b) and ©, Circle loss has a circular decision boundary. The similarity pairs cross the decision boundary through a narrow passage and gather into a relatively concentrated region.

圖5：收斂後的相似度分佈的可視化。藍色點標記整個訓練過程中跨越決策邊界的相似度對。綠點在收斂後標記相似對。（a）AMSoftmax尋求最小化（sn-sp）。在訓練過程中，相似度對通過決策通道跨越決策邊界。收斂之後，相似度對在（sn，sp）空間中的相對較大區域中分散。在（b）和（c）中，圓損失具有圓形決策邊界。相似對通過狹窄的通道越過決策邊界，並聚集到相對集中的區域。

We analyze the impact of two hyper-parameters, i.e., the scale factor γ in Eq. 6 and the relaxation factor m in Eq. 8 on face recognition tasks.
The scale factor γ determines the largest scale of each similarity score. The concept of scale factor is critical in a lot of variants of Softmax loss. We experimentally evaluate its impact on Circle loss and make a comparison with sev- eral other loss functions involving scale factors. We vary γ from 32 to 1024 for both AMSoftmax and Circle loss. For ArcFace, we only set γ to 32, 64 and 128, as it becomes un- stable with larger γ in our implementation. The results are visualized in Fig. 3. Compared with AM-Softmax and Ar- cFace, Circle loss exhibits high robustness on γ. The main reason for the robustness of Circle loss on γ is the auto- matic attenuation of gradients. As the training progresses, the similarity scores approach toward the optimum. Con- sequentially, the weighting scales along with the gradients automatically decay, maintaining a mild optimization.
The relaxation factor m determines the radius of the circular decision boundary. We vary m from −0.2 to 0.3 (with 0.05 as the interval) and visualize the results in Fig. 3 (b). It is observed that under all the settings from −0.1 to 0.25, Circle loss surpasses the best performance of Arcface, as well as AMSoftmax, presenting considerable degree of robustness.

我們分析了兩個超參數（即等式中的比例因子γ）的影響。 6和等式中的鬆弛因子m。 8關於面部識別任務。
比例因子γ確定每個相似性評分的最大比例。比例因子的概念在Softmax損耗的許多變體中至關重要。我們通過實驗評估其對Circle損失的影響，並與其他涉及比例因子的其他損失函數進行比較。對於AMSoftmax和Circle損失，我們將γ從32變爲1024。對於ArcFace，我們僅將γ設置爲32、64和128，因爲在我們的實現中，隨着γ的增大它變得不穩定。結果顯示在圖3中。與AM-Softmax和ArcFace相比，Circle loss對γ表現出很高的魯棒性。圓損失對γ的魯棒性的主要原因是梯度的自動衰減。隨着訓練的進行，相似性分數趨於最佳。因此，權重比例和梯度會自動衰減，從而保持適度的優化。
弛豫因子m確定圓形決策邊界的半徑。我們將m從-0.2更改爲0.3（以0.05爲間隔），並將結果可視化在圖3（b）中。可以看出，在從-0.1到0.25的所有設置下，圓環損耗都超過了Arcface和AMSoftmax的最佳性能，表現出相當高的魯棒性。

4.6. Investigation of the Characteristics

Analysis of the optimization process. To intuitively understand the learning process, we show the change of sn and sp during the whole training process in Fig. 4, from which we draw two observations:
First, at the initialization, all the sn and sp scores are small. It is because in the high dimensional feature space, randomized features are prone to be far away from each other [40, 7]. Correspondingly, sp get significantly larger weights (compared with sn), and the optimization on sp dominates the training, incurring a fast increase in similar- ity values in Fig. 4. This phenomenon evidences that Circle loss maintains a flexible and balanced optimization.
Second, at the end of training, Circle loss achieves both better within-class compactness and between-class discrep- ancy (on the training set), compared with AMSoftmax. Considering the fact that Circle loss achieves higher perfor- mance on the testing set, we believe that it indicates better optimization.

優化過程分析。爲了直觀地瞭解學習過程，我們在圖4中顯示了整個訓練過程中sn和sp的變化，從中我們得出兩個觀察結果：
首先，在初始化時，所有的sn和sp得分都很小。這是因爲在高維特徵空間中，隨機特徵傾向於彼此遠離[40，7]。相應地，sp的權重顯着增大（與sn相比），sp的優化支配了訓練，從而導致圖4中的相似性值快速增加。這種現象表明Circle loss保持了靈活而均衡的優化。
其次，在訓練結束時，與AMSoftmax相比，Circle損失可以同時實現更好的班級內部緊實度和班級間差異（在訓練集上）。考慮到Circle loss在測試集上可獲得更高的性能，我們認爲這表明優化效果更好。

Analysis of the convergence. We analyze the conver- gence status of Circle loss in Fig. 5. We investigate two issues: how do the similarity pairs consisted of sn and sp cross the decision boundary during training and how do the similarity pairs distribute in the (sn , sp ) space after con- vergence. The results are shown in Fig. 5. In Fig. 5 (a), AMSoftmax loss adopts the optimal setting of m = 0.35. In Fig. 5 (b), Circle loss adopts a compromised setting of m = 0.325. The decision boundaries of (a) and (b) are tangent to each other, allowing an intuitive comparison. In Fig. 5 ©, Circle loss adopts its optimal setting of m = 0.25. Comparing Fig. 5 (b) and © against Fig. 5 (a), we find that Circle loss presents a relatively narrower passage on the de- cision boundary, as well as a more concentrated distribution for convergence (especially when m = 0.25). It indicates that Circle loss facilitates more consistent convergence for all the similarity pairs, compared with AMSoftmax loss. This phenomenon confirms that Circle loss has a more defi- nite convergence target, which promotes better separability in the feature space.

收斂性分析。我們在圖5中分析了Circle損失的收斂狀態。我們研究了兩個問題：在訓練過程中，由sn和sp組成的相似性對如何越過決策邊界，以及相似性對如何在（sn，sp）中分佈收斂後的空間。結果如圖5所示。在圖5（a）中，AMSoftmax損失採用m ＝ 0.35的最佳設置。在圖5（b）中，圓損耗採用m = 0.325的折衷設置。（a）和（b）的決策邊界相互切線，從而可以進行直觀的比較。在圖5（c）中，圓損採用其m = 0.25的最佳設置。將圖5（b）和（c）與圖5（a）進行比較，我們發現圓損失在決策邊界上呈現出相對較窄的通道，並且在收斂時呈現出更加集中的分佈（特別是當m = 0.25）。這表明，與AMSoftmax損失相比，Circle損失有助於所有相似對的更一致的收斂。這種現象證實了圓損失具有更明確的收斂目標，從而促進了特徵空間中更好的可分離性。

5. Conclusion

This paper provides two insights into the optimization process for deep feature learning. First, a majority of loss functions, including the triplet loss and popular clas- sification losses, conduct optimization by embedding the between-class and within-class similarity into similarity pairs. Second, within a similarity pair under supervision, each similarity score favors different penalty strength, de- pending on its distance to the optimum. These insights result in Circle loss, which allows the similarity scores to learn at different paces. The Circle loss benefits deep fea- ture learning with high flexibility in optimization and a more definite convergence target. It has a unified formula for two elemental learning approaches, i.e., learning with class-level labels and learning with pair-wise labels. On a variety of deep feature learning tasks, e.g., face recog- nition, person re-identification, and fine-grained image re- trieval, the Circle loss achieves performance on par with the state of the art.

本文爲深度特徵學習的優化過程提供了兩種見解。首先，大多數損失函數（包括三元組損失和流行的分類損失）通過將類間和類內相似性嵌入相似對來進行優化。其次，在監督下的相似度對中，每個相似度得分偏向於不同的懲罰強度，這取決於其到最佳值的距離。這些見解會導致Circle損失，從而使相似度分數以不同的速度學習。 Circle損失使深度功能學習受益於優化的高度靈活性和更明確的收斂目標。它具有用於兩種基本學習方法的統一公式，即使用類級標籤進行學習和使用逐對標籤進行學習。在各種深度特徵學習任務上，例如人臉識別，人員重新識別和細粒度的圖像檢索，Circle損失可實現與現有技術相當的性能。

References

[1] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. 2005 IEEE Computer Society Conference on Computer Vi- sion and Pattern Recognition (CVPR’05), 1:539–546 vol. 1, 2005. 2
[2] J.Deng,J.Guo,N.Xue,andS.Zafeiriou.Arcface:Additive angular margin loss for deep face recognition. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1, 2, 5, 6
[3] W. Ge. Deep metric learning with hierarchical triplet loss. In The European Conference on Computer Vision (ECCV), September 2018. 7
[4] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, 2016. 5
[5] R.Hadsell,S.Chopra,andY.LeCun.Dimensionalityreduc- tion by learning an invariant mapping. In IEEE Computer Society Conference on Computer Vision and Pattern Recog- nition (CVPR), volume 2, pages 1735–1742. IEEE, 2006. 2
[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 5
[7] L.He,Z.Wang,Y.Li,andS.Wang.Softmaxdissection:To- wards understanding intra- and inter-clas objective for em- bedding learning. CoRR, abs/1908.01281, 2019. 8
[8] A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017. 3, 6
[9] E. Hoffer and N. Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015. 1, 2
[10] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Re- port 07-49, University of Massachusetts, Amherst, October 2007. 5, 6
[11] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 6
[12] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4873– 4882, 2016. 5, 6
[13] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon. Attention-based ensemble for deep metric learning. In The European Conference on Computer Vision (ECCV), Septem- ber 2018. 7
[14] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object rep- resentations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013. 5, 7
[15] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 212–220, 2017. 2, 4, 5
[16] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax loss for convolutional neural networks. In ICML, 2016. 1, 2, 4
[17] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney, et al. Iarpa janus benchmark-c: Face dataset and protocol. In 2018 International Conference on Biometrics (ICB), pages 158–165. IEEE, 2018. 5, 6
[18] H. Oh Song, S. Jegelka, V. Rathod, and K. Murphy. Deep metric learning via facility location. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 7
[19] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4004–4012, 2016. 2, 3, 5, 6, 7
[20] M. Opitz, G. Waltner, H. Possegger, and H. Bischof. Deep metric learning with bier: Boosting independent embeddings robustly. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, pages 1–1, 2018. 7
[21] R. Ranjan, C. D. Castillo, and R. Chellappa. L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507, 2017. 2, 4
[22] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. 1, 2, 3
[23] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chel- lappa, and D. W. Jacobs. Frontal to profile face verification in the wild. In 2016 IEEE Winter Conference on Applica- tions of Computer Vision (WACV), pages 1–9. IEEE, 2016. 5, 6
[24] K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NIPS, 2016. 2, 3
[25] Y. Sun, X. Wang, and X. Tang. Deep learning face repre- sentation from predicting 10,000 classes. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1891–1898, 2014. 1
[26] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In The European Conference on Computer Vision (ECCV), September 2018. 6
[27] E.UstinovaandV.S.Lempitsky.Learningdeepembeddings with histogram loss. In NIPS, 2016. 2
[28] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Re- port CNS-TR-2011-001, California Institute of Technology, 2011. 5, 7
[29] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax for face verification. IEEE Signal Processing Let- ters, 25(7):926–930, 2018. 1, 2, 3, 4, 6
[30] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: L2 hypersphere embedding for face verification. In Proceed- ings of the 25th ACM international conference on Multime- dia, pages 1041–1049. ACM, 2017. 2, 3, 4, 6

Circle Loss: A Unified Perspective of Pair Similarity Optimization

Circle Loss: A Unified Perspective of Pair Similarity Optimization

Abstract

1. Introduction

2. A Unified Perspective

3. A New Loss Function

3.1. Self-paced Weighting

3.2. Within-class and Between-class Margins

3.3. The Advantages of Circle Loss

4. Experiment

4.1. Settings

4.2. Face Recognition

4.3. Person Re-identification

4.4. Fine-grained Image Retrieval

4.5. Impact of the Hyper-parameters

4.6. Investigation of the Characteristics

5. Conclusion

References

排名前三的自然語言處理庫教程 ---- Adam Studio

Relational inductive biases, deep learning, and graph networks（關係歸納偏差、深度學習和圖形網絡）

面向機器學習初學者的50大問答題 ---Adam Studio

Semantic Flow for Fast and Accurate Scene Parsing

XGBoost: Scalable GPU Accelerated Learning （XGBoost：可擴展的GPU加速學習）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結