業務指標分析 | 我用一條SQL統計了PV、UV和二跳率

場景準備

假設有一張track_log表,裏面有字段:url、guid、sessionid、ds(url、全局唯一標識、會話id、日期),數據示例如下,其中,一個guid可以有多個sessionid;一個sessionid對應一個guid,但是存在對應多個guid的非法數據url可能爲空

數據鏈接如下
鏈接:數據源下載
提取碼:837r

url     guid    sessionid       ds
http://www.yhd.com/?union_ref=7&cp=0    PR4E9HWE38DMN4Z6HUG667SCJNZXMHSPJRER    VFA5QRQ1N4UJNS9P6MH6HPA76SXZ737P        2015-08-28
http://my.yhd.com/order/finishOrder.do?orderCode=5435446505152  YJ25S3QAVPAS31PHSB3HFGZ1E5AYMKX9XUTX    6W26QM41DM6HHND3R4FP42YYXXE1NKGA        2015-08-28
http://list.yhd.com/p/c5072-b-a-s1-v0-p1-price-d0-pid-pt1086211-pl1171565-m0-k?tp=44.1086211.0.0.0.Kxnn54p-11-FFJKr     JRBWWU6ECXN15Q2Z5QT4TETNHKY7QHE3Y8B3    5Z5JZMYUGK9TP3QWHDDTU6G5T6PHEQRZ    2015-08-28
http://list.yhd.com/p/c5996-b-a-s1-v0-p1-price-d0-pid21496-pt1074467-pl1157690-m0-k?tp=44.1074467.0.0.0.KxnlcrD-11-EnNUs        37G1MDD68UF8K9XYGVCUA9WFNNR7C1133W9S    5TMZXMUKJWK76FNZMVE2TCM4UQW7ZNJH    2015-08-28
        2EC97A32-7C27-4F53-A122-B60AA9A987F7    PAU63A8H6A21F81NHTG2X4O9M08Y6148        2015-08-28
http://t.yhd.com/detailBrand/21782?tp=4.174850.m3022912.0.2.KxnmxAS-11-8nDA8    X491CWDNMC1YTEK7WRVUTQMZMXF4X63U54SC    25UYKQWJ13GB2E23SC1HH64HXV12TX3E        2015-08-28
        7878E64C-B06B-4188-9F77-C4C77C38FDBD    823JXQ1J6BH4L0C68I78R83TK9ZQ6VND        2015-08-28
        E5FC9449-3777-4103-BF0E-F402F74E2C00    EW935DJP6RWJUCDR8REOQ1YA808F6LBF        2015-08-28
        00000000-4049-1cca-e842-1f5049b8efdf    W2T8GEGK1JGHNGVCQRKP4A9S681VNBBF        2015-08-28
http://item.m.yhd.com/item/73725?tp=5006.0.1756.0.11.Kxnlg48-11-4lXQc   3CUMUY385J8ZH1Y72EA8C1W3WT9PT3C2PN4C    HEZ1RNMMBV1D4RQ38CB27J9KN95ZFNF4        2015-08-28
        98363275-13EB-412C-AF69-FC20D7ADF622    4M9F757E3AN7PB7UMK75P40DLCWEJ55V        2015-08-28
        7569A971-FC5E-4AA1-98B4-7F1BFCB50C96    9H540927X4PVFDYLI9X9836S2AN4F1O4        2015-08-28

業務要求

如下,統計pv、uv和二跳率

date pv uv second_rate
2015-08-28 35880 16065 0.4

SQL分析

  • pv: count(url),統計表總行數,可以分爲兩步,先分組統計單個sessionid的總行數,再sum
  • uv: count(distinct guid)
  • 二跳率:pv大於等於2的sessionid數量/總pv
    • pv大於等於2的sessionid:用case when … then … else … end
    • count(distinct case when pv>=2 then sessionid else null end) / count(distinct sessionid)

所以單個指標統計可以如下:

pv:
SELECT SUM(single_session_pv) AS pv
FROM (
	SELECT COUNT(*) AS single_session_pv
	FROM track_log
	WHERE length(url) > 0
	GROUP BY sessionid
) t;

uv:
SELECT COUNT(*)
FROM (
	SELECT guid
	FROM track_log
	WHERE length(url) > 0
	GROUP BY guid
) t;

二跳的數量:
SELECT COUNT(CASE 
		WHEN single_session_pv >= 2 THEN single_session_pv
		ELSE NULL
	END) AS second_count
FROM (
	SELECT COUNT(*) AS single_session_pv
	FROM track_log
	WHERE length(url) > 0
	GROUP BY sessionid
) t;

整合爲一個sql:

SELECT ds, SUM(sid_pv) AS pv, COUNT(DISTINCT guid) AS uv
	, round(COUNT(CASE 
		WHEN sid_pv >= 2 THEN sid
		ELSE NULL
	END) / COUNT(DISTINCT guid), 2) AS second_rate	# pv大於2的session的總數/總pv
FROM (
	SELECT ds, MIN(guid) AS guid, sessionid AS sid
		, COUNT(*) AS sid_pv
	FROM track_log
	WHERE length(url) > 0	# 數據清洗
	GROUP BY ds, sessionid	# 分組去重
) t
GROUP BY ds;

注意點

  • 由於可能存在一個session有多個guid的非法數據,所以對guid取min;實際上,如果沒有進行聚合,本身也會因爲group by那裏沒有對guid分組而報錯
  • 由於一個guid可以有多個session,所以需要進行去重
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章