場景準備
假設有一張track_log表,裏面有字段:url、guid、sessionid、ds(url、全局唯一標識、會話id、日期),數據示例如下,其中,一個guid可以有多個sessionid;一個sessionid對應一個guid,但是存在對應多個guid的非法數據;url可能爲空。
數據鏈接如下
鏈接:數據源下載
提取碼:837r
url guid sessionid ds
http://www.yhd.com/?union_ref=7&cp=0 PR4E9HWE38DMN4Z6HUG667SCJNZXMHSPJRER VFA5QRQ1N4UJNS9P6MH6HPA76SXZ737P 2015-08-28
http://my.yhd.com/order/finishOrder.do?orderCode=5435446505152 YJ25S3QAVPAS31PHSB3HFGZ1E5AYMKX9XUTX 6W26QM41DM6HHND3R4FP42YYXXE1NKGA 2015-08-28
http://list.yhd.com/p/c5072-b-a-s1-v0-p1-price-d0-pid-pt1086211-pl1171565-m0-k?tp=44.1086211.0.0.0.Kxnn54p-11-FFJKr JRBWWU6ECXN15Q2Z5QT4TETNHKY7QHE3Y8B3 5Z5JZMYUGK9TP3QWHDDTU6G5T6PHEQRZ 2015-08-28
http://list.yhd.com/p/c5996-b-a-s1-v0-p1-price-d0-pid21496-pt1074467-pl1157690-m0-k?tp=44.1074467.0.0.0.KxnlcrD-11-EnNUs 37G1MDD68UF8K9XYGVCUA9WFNNR7C1133W9S 5TMZXMUKJWK76FNZMVE2TCM4UQW7ZNJH 2015-08-28
2EC97A32-7C27-4F53-A122-B60AA9A987F7 PAU63A8H6A21F81NHTG2X4O9M08Y6148 2015-08-28
http://t.yhd.com/detailBrand/21782?tp=4.174850.m3022912.0.2.KxnmxAS-11-8nDA8 X491CWDNMC1YTEK7WRVUTQMZMXF4X63U54SC 25UYKQWJ13GB2E23SC1HH64HXV12TX3E 2015-08-28
7878E64C-B06B-4188-9F77-C4C77C38FDBD 823JXQ1J6BH4L0C68I78R83TK9ZQ6VND 2015-08-28
E5FC9449-3777-4103-BF0E-F402F74E2C00 EW935DJP6RWJUCDR8REOQ1YA808F6LBF 2015-08-28
00000000-4049-1cca-e842-1f5049b8efdf W2T8GEGK1JGHNGVCQRKP4A9S681VNBBF 2015-08-28
http://item.m.yhd.com/item/73725?tp=5006.0.1756.0.11.Kxnlg48-11-4lXQc 3CUMUY385J8ZH1Y72EA8C1W3WT9PT3C2PN4C HEZ1RNMMBV1D4RQ38CB27J9KN95ZFNF4 2015-08-28
98363275-13EB-412C-AF69-FC20D7ADF622 4M9F757E3AN7PB7UMK75P40DLCWEJ55V 2015-08-28
7569A971-FC5E-4AA1-98B4-7F1BFCB50C96 9H540927X4PVFDYLI9X9836S2AN4F1O4 2015-08-28
業務要求
如下,統計pv、uv和二跳率
date | pv | uv | second_rate |
---|---|---|---|
2015-08-28 | 35880 | 16065 | 0.4 |
SQL分析
- pv: count(url),統計表總行數,可以分爲兩步,先分組統計單個sessionid的總行數,再sum
- uv: count(distinct guid)
- 二跳率:pv大於等於2的sessionid數量/總pv
- pv大於等於2的sessionid:用case when … then … else … end
- count(distinct case when pv>=2 then sessionid else null end) / count(distinct sessionid)
所以單個指標統計可以如下:
pv:
SELECT SUM(single_session_pv) AS pv
FROM (
SELECT COUNT(*) AS single_session_pv
FROM track_log
WHERE length(url) > 0
GROUP BY sessionid
) t;
uv:
SELECT COUNT(*)
FROM (
SELECT guid
FROM track_log
WHERE length(url) > 0
GROUP BY guid
) t;
二跳的數量:
SELECT COUNT(CASE
WHEN single_session_pv >= 2 THEN single_session_pv
ELSE NULL
END) AS second_count
FROM (
SELECT COUNT(*) AS single_session_pv
FROM track_log
WHERE length(url) > 0
GROUP BY sessionid
) t;
整合爲一個sql:
SELECT ds, SUM(sid_pv) AS pv, COUNT(DISTINCT guid) AS uv
, round(COUNT(CASE
WHEN sid_pv >= 2 THEN sid
ELSE NULL
END) / COUNT(DISTINCT guid), 2) AS second_rate # pv大於2的session的總數/總pv
FROM (
SELECT ds, MIN(guid) AS guid, sessionid AS sid
, COUNT(*) AS sid_pv
FROM track_log
WHERE length(url) > 0 # 數據清洗
GROUP BY ds, sessionid # 分組去重
) t
GROUP BY ds;
注意點
- 由於可能存在一個session有多個guid的非法數據,所以對guid取min;實際上,如果沒有進行聚合,本身也會因爲group by那裏沒有對guid分組而報錯
- 由於一個guid可以有多個session,所以需要進行去重