阿里雲RDS PostgreSQL時序數據的優化
- 2008年6月8日
- 版權聲明:本文爲博主chszs的原創文章,未經博主允許不得轉載。
PostgreSQL是最流行的開源數據庫之一。PostgreSQL的一個優勢是它可以通過多種方式進行優化,例如數據合併和數據清理。而數據合併和數據清理在一些應用場景下是必需的。
例如:
- 當數據表的具體內容被更新(插入、更新、刪除)時,我們需要先合併,然後迅速得到每個主鍵的最新值。
- 當我們有大量連續不斷地報告數據的傳感器時,我們需要及時收集每個傳感器的最新讀數。
- 我們可以使用窗口查詢進行這種操作,但我們需要快速檢索批量數據。
通常有四種優化時序數據的方法:
- 當只有很少的唯一值和一個未知的範圍時,我們可以使用遞歸(recursion)。
- 當只有很少的唯一值並且它們的範圍已經確定,我們可以使用子查詢(subquery)。
- 當有許多唯一值時,窗口查詢(Window query)比上述方法更合適。
- 流計算(Stream computing)是所有場景中最好的。
本文只會比較前三種方法。流計算不需要進行比較,因爲它是所有場景中最強大的方法。
遞歸 vs. 子查詢 vs. 窗口查詢
在比較中,將使用一個包含500萬條唯一值的數據庫作爲數據源,並在以下情況下比較這些方法。
遞歸
情景1、有大量有效的唯一值(100萬個唯一值)
第1步:創建一個表
\timing
drop table test;
create unlogged table test(id int , info text, crt_time timestamp);
第2步:構建數據
insert into test select ceil(random()*1000000), md5(random()::text), clock_timestamp() from generate_series(1,5000000);
第3步:創建一個索引
create index idx_test_1 on test (id, crt_time desc);
第4步:遞歸查詢效率
explain (analyze,verbose,timing,costs,buffers) with recursive skip as (
(
select test as v from test where id in (select id from test where id is not null order by id,crt_time desc limit 1) limit 1
)
union all
(
select (
select t as v from test t where t.id>(s.v).id and t.id is not null order by id,crt_time desc limit 1
) from skip s where (s.v).id is not null
) -- The "where (s.v).id is not null" must be included. Else you will be stuck in an infinite loop.
)
select (t.v).id, (t.v).info, (t.v).crt_time from skip t where t.* is not null;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on skip t (cost=54.35..56.37 rows=100 width=44) (actual time=0.042..6626.084 rows=993288 loops=1)
Output: (t.v).id, (t.v).info, (t.v).crt_time
Filter: (t.* IS NOT NULL)
Rows Removed by Filter: 1
Buffers: shared hit=3976934
CTE skip
-> Recursive Union (cost=0.91..54.35 rows=101 width=69) (actual time=0.034..6006.615 rows=993289 loops=1)
Buffers: shared hit=3976934
-> Limit (cost=0.91..0.93 rows=1 width=69) (actual time=0.033..0.033 rows=1 loops=1)
Output: test.*
Buffers: shared hit=8
-> Nested Loop (cost=0.91..10.19 rows=500 width=69) (actual time=0.032..0.032 rows=1 loops=1)
Output: test.*
Buffers: shared hit=8
-> HashAggregate (cost=0.48..0.49 rows=1 width=4) (actual time=0.021..0.021 rows=1 loops=1)
Output: test_1.id
Group Key: test_1.id
Buffers: shared hit=4
-> Limit (cost=0.43..0.47 rows=1 width=12) (actual time=0.016..0.016 rows=1 loops=1)
Output: test_1.id, test_1.crt_time
Buffers: shared hit=4
-> Index Only Scan using idx_test_1 on public.test test_1 (cost=0.43..173279.36 rows=5000002 width=12) (actual time=0.015..0.015 rows=1 loops=1)
Output: test_1.id, test_1.crt_time
Index Cond: (test_1.id IS NOT NULL)
Heap Fetches: 1
Buffers: shared hit=4
-> Index Scan using idx_test_1 on public.test (cost=0.43..9.64 rows=6 width=73) (actual time=0.009..0.009 rows=1 loops=1)
Output: test.*, test.id
Index Cond: (test.id = test_1.id)
Buffers: shared hit=4
-> WorkTable Scan on skip s (cost=0.00..5.14 rows=10 width=32) (actual time=0.006..0.006 rows=1 loops=993289)
Output: (SubPlan 1)
Filter: ((s.v).id IS NOT NULL)
Rows Removed by Filter: 0
Buffers: shared hit=3976926
SubPlan 1
-> Limit (cost=0.43..0.49 rows=1 width=81) (actual time=0.005..0.005 rows=1 loops=993288)
Output: t_1.*, t_1.id, t_1.crt_time
Buffers: shared hit=3976926
-> Index Scan using idx_test_1 on public.test t_1 (cost=0.43..102425.17 rows=1666667 width=81) (actual time=0.005..0.005 rows=1 loops=993288)
Output: t_1.*, t_1.id, t_1.crt_time
Index Cond: ((t_1.id > (s.v).id) AND (t_1.id IS NOT NULL))
Buffers: shared hit=3976926
Planning time: 0.354 ms
Execution time: 6706.105 ms
(45 rows)
情景二、只有很少有效的唯一值(1,000個唯一值)
第1步:創建一個表
\timing
drop table test;
create unlogged table test(id int , info text, crt_time timestamp);
第2步:構建數據
insert into test select ceil(random()*1000), md5(random()::text), clock_timestamp() from generate_series(1,5000000);
第3步:創建一個索引
create index idx_test_1 on test (id, crt_time desc);
第4步:遞歸查詢效率
Query statement stays unchanged
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on skip t (cost=55.09..57.11 rows=100 width=44) (actual time=0.046..8.859 rows=1000 loops=1)
Output: (t.v).id, (t.v).info, (t.v).crt_time
Filter: (t.* IS NOT NULL)
Rows Removed by Filter: 1
Buffers: shared hit=4007
CTE skip
-> Recursive Union (cost=0.91..55.09 rows=101 width=69) (actual time=0.039..8.203 rows=1001 loops=1)
Buffers: shared hit=4007
-> Limit (cost=0.91..1.67 rows=1 width=69) (actual time=0.038..0.038 rows=1 loops=1)
Output: test.*
Buffers: shared hit=8
-> Nested Loop (cost=0.91..6335.47 rows=8333 width=69) (actual time=0.038..0.038 rows=1 loops=1)
Output: test.*
Buffers: shared hit=8
-> HashAggregate (cost=0.48..0.49 rows=1 width=4) (actual time=0.021..0.021 rows=1 loops=1)
Output: test_1.id
Group Key: test_1.id
Buffers: shared hit=4
-> Limit (cost=0.43..0.47 rows=1 width=12) (actual time=0.016..0.017 rows=1 loops=1)
Output: test_1.id, test_1.crt_time
Buffers: shared hit=4
-> Index Only Scan using idx_test_1 on public.test test_1 (cost=0.43..173279.55 rows=5000002 width=12) (actual time=0.015..0.015 rows=1 loops=1)
Output: test_1.id, test_1.crt_time
Index Cond: (test_1.id IS NOT NULL)
Heap Fetches: 1
Buffers: shared hit=4
-> Index Scan using idx_test_1 on public.test (cost=0.43..6284.98 rows=5000 width=73) (actual time=0.015..0.015 rows=1 loops=1)
Output: test.*, test.id
Index Cond: (test.id = test_1.id)
Buffers: shared hit=4
-> WorkTable Scan on skip s (cost=0.00..5.14 rows=10 width=32) (actual time=0.008..0.008 rows=1 loops=1001)
Output: (SubPlan 1)
Filter: ((s.v).id IS NOT NULL)
Rows Removed by Filter: 0
Buffers: shared hit=3999
SubPlan 1
-> Limit (cost=0.43..0.49 rows=1 width=81) (actual time=0.007..0.007 rows=1 loops=1000)
Output: t_1.*, t_1.id, t_1.crt_time
Buffers: shared hit=3999
-> Index Scan using idx_test_1 on public.test t_1 (cost=0.43..102425.80 rows=1666667 width=81) (actual time=0.007..0.007 rows=1 loops=1000)
Output: t_1.*, t_1.id, t_1.crt_time
Index Cond: ((t_1.id > (s.v).id) AND (t_1.id IS NOT NULL))
Buffers: shared hit=3999
Planning time: 0.353 ms
Execution time: 8.980 ms
(45 rows)
子查詢
情景1、有大量有效的唯一值(100萬個唯一值)
第1步:子查詢查詢效率
如果ID的值範圍過寬,子查詢效率會很低下。
需要維護一個唯一的ID表。這裏我們使用generate_series作爲測試的替代。
explain (analyze,verbose,timing,costs,buffers) select (select test from test where id=t.id order by crt_time desc limit 1) from generate_series(1,1000000) t(id);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on pg_catalog.generate_series t (cost=0.00..1976.65 rows=1000 width=32) (actual time=70.682..2835.109 rows=1000000 loops=1)
Output: (SubPlan 1)
Function Call: generate_series(1, 1000000)
Buffers: shared hit=3997082
SubPlan 1
-> Limit (cost=0.43..1.97 rows=1 width=77) (actual time=0.002..0.002 rows=1 loops=1000000)
Output: test.*, test.crt_time
Buffers: shared hit=3997082
-> Index Scan using idx_test_1 on public.test (cost=0.43..9.64 rows=6 width=77) (actual time=0.002..0.002 rows=1 loops=1000000)
Output: test.*, test.crt_time
Index Cond: (test.id = t.id)
Buffers: shared hit=3997082
Planning time: 0.119 ms
Execution time: 2892.712 ms
(14 rows)
情景二、只有很少有效的唯一值(1,000個唯一值)
第1步:子查詢查詢效率
查詢語句更改爲
explain (analyze,verbose,timing,costs,buffers) select (select test from test where id=t.id order by crt_time desc limit 1) from generate_series(1,1000) t(id);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on pg_catalog.generate_series t (cost=0.00..1699.41 rows=1000 width=32) (actual time=0.107..7.041 rows=1000 loops=1)
Output: (SubPlan 1)
Function Call: generate_series(1, 1000)
Buffers: shared hit=4000
SubPlan 1
-> Limit (cost=0.43..1.69 rows=1 width=77) (actual time=0.006..0.007 rows=1 loops=1000)
Output: test.*, test.crt_time
Buffers: shared hit=4000
-> Index Scan using idx_test_1 on public.test (cost=0.43..6284.98 rows=5000 width=77) (actual time=0.006..0.006 rows=1 loops=1000)
Output: test.*, test.crt_time
Index Cond: (test.id = t.id)
Buffers: shared hit=4000
Planning time: 0.131 ms
Execution time: 7.126 ms
(14 rows)
窗口查詢
情景1、有大量有效的唯一值(100萬個唯一值)
第1步:窗口查詢效率
explain (analyze,verbose,timing,costs,buffers) select id,info,crt_time from (select row_number() over (partition by id order by crt_time desc) as rn, * from test) t where rn=1;
postgres=# explain (analyze,verbose,timing,costs,buffers) select id,info,crt_time from (select row_number() over (partition by id order by crt_time desc) as rn, * from test) t where rn=1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on t (cost=0.43..310779.41 rows=25000 width=45) (actual time=0.027..6398.308 rows=993288 loops=1)
Output: t.id, t.info, t.crt_time
Filter: (t.rn = 1)
Rows Removed by Filter: 4006712
Buffers: shared hit=5018864
-> WindowAgg (cost=0.43..248279.39 rows=5000002 width=53) (actual time=0.026..5973.497 rows=5000000 loops=1)
Output: row_number() OVER (?), test.id, test.info, test.crt_time
Buffers: shared hit=5018864
-> Index Scan using idx_test_1 on public.test (cost=0.43..160779.35 rows=5000002 width=45) (actual time=0.019..4058.476 rows=5000000 loops=1)
Output: test.id, test.info, test.crt_time
Buffers: shared hit=5018864
Planning time: 0.121 ms
Execution time: 6446.901 ms
(13 rows)
情景二、只有很少有效的唯一值(1,000個唯一值)
第1步:窗口查詢效率
查詢語句保持不變
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on t (cost=0.43..310779.61 rows=25000 width=45) (actual time=0.027..6176.801 rows=1000 loops=1)
Output: t.id, t.info, t.crt_time
Filter: (t.rn = 1)
Rows Removed by Filter: 4999000
Buffers: shared hit=4744850 read=18157
-> WindowAgg (cost=0.43..248279.58 rows=5000002 width=53) (actual time=0.026..5822.576 rows=5000000 loops=1)
Output: row_number() OVER (?), test.id, test.info, test.crt_time
Buffers: shared hit=4744850 read=18157
-> Index Scan using idx_test_1 on public.test (cost=0.43..160779.55 rows=5000002 width=45) (actual time=0.020..4175.082 rows=5000000 loops=1)
Output: test.id, test.info, test.crt_time
Buffers: shared hit=4744850 read=18157
Planning time: 0.108 ms
Execution time: 6176.924 ms
(13 rows)
橫向效率比較圖
結論
隨着物聯網的興起,越來越多的業務會使用時序數據,在必須根據這些數據提供服務的情況下,計算數據中的最新值和滑動窗口中的值是至關重要的。
PostgreSQL是開源數據庫的最佳選擇,因爲它爲相同的問題提供了幾種解決方案。參考數據優化方法,我們可以得出結論:
- 遞歸適用於只有少量唯一值但範圍未知的情況。
- 子查詢適用於唯一值很少且範圍已確定的情況。
- 當有大量唯一值時,窗口查詢比子查詢更合適。
- 流計算是所有場景的最佳選擇。