閒魚商品理解數據分析平臺——龍宮

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"引言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"閒魚是一個以C2C爲主的平臺,區別於B端的用戶,C端賣家在發佈商品時更傾向於圖+描述的輕發佈模式,對於補充商品的結構化信息往往執行力和專業程度都不高,這爲我們的商品理解帶來了很大的困難。爲了能夠在發佈側獲得更多的商品結構化信息,我們開始嘗試在原有圖+文的極簡發佈模式中加入商品關鍵屬性的補充選項,事實證明,適當的結構化屬性選項並不會影響用戶的發佈體驗,卻能極大地提升我們對商品理解的能力。然而存在以下問題:在設定結構化屬性選項時,往往強依賴行業運營的經驗,缺少實時的、多維度的數據分析手段。雖然離線產出的數據報表能夠在一定程度上統計某些關鍵指標,但對於精細的、個性化的數據查詢需求,離線報表擴展性和性能都不足。基於上述問題,我們搭建了龍宮數據分析平臺。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"龍宮數據分析平臺的定位與總體框架"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"區別於數據報表,我們在設計龍宮數據分析平臺時主要考慮了以下方面:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實時性要求,當運營上線新策略或因服務問題線上出現數據波動時,我們希望能夠實時地分析出結構化類目屬性在這段時間內的覆蓋情況,以幫忙運營做進一步的決策。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多維度要求,閒魚目前擁有8000+的葉子類目,不同行業運營的側重點各不相同,數據分析平臺要能夠滿足個性化的數據分析需求。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據管理要求,閒魚的類目屬性、SPU數據、運營策略需要一個統一管理的地方。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們希望以此實現結構化數據對運營的反哺,構成商品結構化數據生產與應用的閉環。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/30\/59\/30c1f89fec2529a2b5cddd5097685c59.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總體分層框架如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/1a\/0c\/1aa25247fbb2d8ab7f24410c156ede0c.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據鏈路建設"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"構建數據分析平臺的關鍵是數據鏈路的建設,在閒魚,結構化數據主要分爲在線數據(通過發佈、編輯入口用戶直接填寫的數據)和離線數據(通過後置算法模型分析商品的圖文獲取的數據)。數據鏈路的建設存在以下關鍵難點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 存儲數據量大(全量20億+),訪問QPS高(1.5萬+),服務穩定性要求高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 數據來源多(10餘種),各來源數據異構,存在重複、衝突的數據,數據實時性要求高(秒級延遲)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 數據分析場景複雜(QPS小,但sql複雜度高),普通數據庫查詢難以支撐。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對數據量大和QPS高的問題,我們選用tableStore作爲存儲商品結構化信息的數據庫,它一種典型的列存儲數據庫,具有擴展性好、可用性高、單機可支撐QPS上萬的特性,非常適合作爲大數據的存儲終端。其可用性可達99.99%,同時具有主備雙庫能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時我們在線數據存儲在mysql的商品表中,通過在java應用中監聽表的變更將數據寫入數據源表;離線數據通過ODPS+MQ的方式將數據傳入算法模塊,並通過blink將算法結果寫入數據源表。由於在線離線的多來源數據可能存在重複、衝突的問題(同一商品算法A識別爲iphone 12而算法B識別爲iphone 11),所以在系統設計時我們使用源表來存儲所有的原數據,使用終表來存儲加工融合之後的數據,加工融合的策略是產品、運營可決策的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析數據我們使用的是分析型數據庫ADB,ADB在存儲容量、單機查詢QPS方面都遠不及tableStore,但它在複雜sql的運行、實時索引創建、冷熱數據隔離等方面擁有其他數據庫不及的性能,是數據分析庫的較好選擇。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/3c\/95\/3c5386bf9aec7c4d7e8a8cacbd3efa95.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"離線異構數據源的接入"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在閒魚,結構化數據不僅僅來自於發佈時的賣家填寫,正如前面所說,閒魚的C端賣家在填寫結構化屬性的專業程度和執行力都遠遠不及淘寶天貓的賣家,所以我們通過圖文多模態的算法,在發佈的後置鏈路中爲商品補充很多結構化的屬性(這部分cpv目前佔大盤覆蓋率的一半左右,不同類目情況不同)。接入這些離線數據具有以下難點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 各個數據具有結構不同、產出時間不同、數據量級大的特徵,難以複用相同模式,接入新數據源的成本高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 數據同步任務分散,難以做統一監控。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對這些難點我們設計了一套離線異構數據源統一接入的方案:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/76\/7669a440f9dcabe92abf905e348cbba8.png","alt":"圖片","title":"null","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"各個算法的離線數據存儲在ODPS中,各個算法的數據格式不一樣,數據的分區也不同,所以先通過一個ODPS的同步任務將各數據源數據統一到一張結構化標準標籤表idle_kgraph_std_source中。表結構如下:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9e\/9ebce3cdd29a5bcfc98004de8c5910ba.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表中key爲主鍵信息,因爲不同場景的數據主鍵不一樣,所以這裏設計爲開放式的主鍵,數據爲json格式,key爲主鍵列名,value爲主鍵值。結構化標準表idle_kgraph_std_source通過一個Blink任務實時同步到tableStore的各個場景的數據表中。在Blink任務中,根據scene和source字段將數據進行分發,根據data中的key將數據路由到tableStore表中的不同列。同時爲了提升效率,減少在Blink任務中寫數據庫的次數,拿到數據後,先對數據進行合併操作,將同種場景(例如結構化屬性數據)的多來源數據合併成一條,再進行寫操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過這套方案,我們成功解決了多數據源接入中數據難以收口,難以統一監控的問題,同時,數據標準表中開放式數據格式的設計使得新數據源能夠快速接入,極大地降低了重複開發的成本。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據加工融合"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在獲取到多來源數據之後,我們需要對數據進行加工融合,融合的策略是由產品和運營共同決定的,在變更策略時,存量的商品數據也需要重新進行加工融合,所以數據加工融合鏈路必須具備增量處理穩定,全量處理快速的特點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/48\/48c8e1a0759c2aaed84c8806959b0751.png","alt":"圖片","title":"null","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在進行全量處理數據時,利用分佈式任務調度系統,主任務節點通過數據庫的分片將全量數據劃分成多份,並將數據索引下發給各個子任務節點,由子任務進行數據拉取,使數據拉取與處理不受數據庫的物理分區與通道限制,大大提升性能,目前6億全量數據處理僅需40min。任務的分發策略如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a1\/a18bdd880a6ab1b0de2a795d30efbd5f.png","alt":"圖片","title":"null","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總的來說,主要解決以下問題:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分佈式任務分發,分佈式完成全量任務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"操作冪等,操作可以重複,但不影響最終結果。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全量增量彼此隔離,不影響在線服務"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據分析模塊設計"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在數據分析的場景中,大量涉及到正逆排序、按某指標過濾的頻繁查詢,如果對每一次數據分析請求都做一次完整的數據查詢,對數據庫會造成較大壓力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以設計數據分析模塊時,我們將請求的分析條件分爲兩類:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"維度分析條件:根據不同的維度,需要運行不同的查詢邏輯。會通過一個Distributor將分析請求路由到不同的processor中執行。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"篩選排序條件:這些條件不影響查詢的邏輯,只會在查詢的結果中進行排序或過濾,針對這種情況我們會優先從緩存獲取結果,並在內存中進行排序過濾,以提升分析性能。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/c5\/49\/c5e9bdce59b36d2b521594938e005649.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於上述的方案,可有效降低數據分析查詢的成本,使查詢平均效率提升50%以上。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"效果"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前龍宮平臺已推廣至行業運營、閒魚搜索、閒魚首頁推薦等場景,取得了階段性效果:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 爲行業運營提供8000+葉子類目屬性維度的數據分析,助力運營決策結構化選項在閒魚主發中的透出,幫忙其爲閒魚結構化大盤貢獻8成類目覆蓋,一半核心cpv覆蓋。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 爲搜索、推薦等場景提供快捷的查詢手段,幫助開發、算法同學實時定位線上問題,實現秒級延遲。大大提升了badcase歸因定位效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b9\/b9e60b07a69ffbe48e39d60f31a1d563.png","alt":"圖片","title":"null","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/94\/94927892a6516b6b692c1cdaef692b95.png","alt":"圖片","title":"null","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e8\/e8602d68927e29ef0d4969362455c3b6.png","alt":"圖片","title":"null","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們致力於將龍宮打造成一個全面、靈活、準確的商品理解數據平臺,接下來我們將主要針對以下方面繼續優化:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 與商品發佈、成交大盤對接,接入商品診斷能力,提供更多維度的數據分析能力,推廣場景覆蓋,助力更多的產品、運營快速決策。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 增加更多、更直觀的數據表現形式,優化界面與UI設計。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 增加用戶維度的數據分析能力,與算法對接,將數據分析的結果反哺算法,使得算法模型能預測出準確、個性化的類目屬性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:閒魚技術(ID:XYtech_Alibaba)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/7dXjeYYwQ1GQkK5KHLB2HA","title":"xxx","type":null},"content":[{"type":"text","text":"閒魚商品理解數據分析平臺——龍宮"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章