SQLAlchemy in 查詢空列表問題分析

問題場景

有model Account，SQLAlchemy 查詢語句如下：

query = Account.query.filter(Account.id.in_(account_ids)).order_by(Account.date_created.desc())

這裏 uids 如果爲空，執行查詢會有如下警告：

/usr/local/lib/python2.7/site-packages/sqlalchemy/sql/default_comparator.py:35: SAWarning: The IN-predicate on "account.id" was invoked with an empty sequence. This results in a contradiction, which nonetheless can be expensive to evaluate.  Consider alternative strategies for improved performance.
  return o[0](self, self.expr, op, *(other + o[1:]), **kwargs)

這裏的意思是使用一個空的列表會花費較長的時間，需要優化以提高性能。

爲什麼會有這個提示呢？一個空列表爲什麼會影響性能呢？

首先打印 query 可得到如下 sql 語句：

SELECT *   // 字段使用 “*” 代替
FROM account
WHERE account.id != account.id ORDER BY account.date_created DESC

會發現生成的語句中過濾條件是 WHERE account.id != account.id，使用 PostgreSQL Explain ANALYZE 命令，

EXPLAIN：顯示PostgreSQL計劃程序爲提供的語句生成的執行計劃。
ANALYZE：收集有關數據庫中表的內容的統計信息。

分析查詢成本結果如下：

postgres=> EXPLAIN ANALYZE SELECT *
FROM account
WHERE account.id != account.id ORDER BY account.date_created DESC;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Sort  (cost=797159.14..808338.40 rows=4471702 width=29) (actual time=574.002..574.002 rows=0 loops=1)
   Sort Key: date_created DESC
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on account  (cost=0.00..89223.16 rows=4471702 width=29) (actual time=573.991..573.991 rows=0 loops=1)
         Filter: (id <> id)
         Rows Removed by Filter: 4494173
 Planning time: 0.162 ms
 Execution time: 574.052 ms
(8 rows)

先看Postgresql提供的語句生成的執行計劃，通過結果可以看到，雖然返回值爲空，但是查詢成本卻還是特別高，執行計劃部分幾乎所有的時間都耗費在排序上，但是和執行時間相比，查詢計劃的時間可以忽略不計。（結果是先遍歷全表，查出所有數據，然後再使用 Filter: (id <> id) 把所有數據過濾。）

按照這個思路，有兩種查詢方案：

如果 account_ids 爲空，那麼直接返回空列表不進行任何操作，查詢語句變爲：

if account_ids:
    query = Account.query.filter(Account.id.in_(account_ids)).order_by(Account.date_created.desc())

如果 account_ids 爲空，那麼過濾方式，查詢語句變爲：

query = Account.query
if account_ids:
    query = query.filter(Account.id.in_(account_ids))
else:
    query = query.filter(False)
    
query = query.order_by(Account.date_created.desc())

如果 account_ids 爲空，此時生成的 SQL 語句結果爲：

SELECT *
FROM account
WHERE 0 = 1 ORDER BY account.date_created DESC

分析結果爲：

postgres=> EXPLAIN ANALYZE SELECT *
FROM account
WHERE 0 = 1 ORDER BY account.date_created DESC;
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Sort  (cost=77987.74..77987.75 rows=1 width=29) (actual time=0.011..0.011 rows=0 loops=1)
   Sort Key: date_created DESC
   Sort Method: quicksort  Memory: 25kB
   ->  Result  (cost=0.00..77987.73 rows=1 width=29) (actual time=0.001..0.001 rows=0 loops=1)
         One-Time Filter: false
         ->  Seq Scan on account  (cost=0.00..77987.73 rows=1 width=29) (never executed)
 Planning time: 0.197 ms
 Execution time: 0.061 ms
(8 rows)

可以看到，查詢計劃和執行時間都有大幅提高。

一個測試

如果只是去掉方案1排序，查看一下分析結果

使用 PostgreSQL Explain ANALYZE 命令分析查詢成本結果如下：

postgres=> EXPLAIN ANALYZE SELECT *
FROM account
WHERE account.id != account.id;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Seq Scan on account  (cost=0.00..89223.16 rows=4471702 width=29) (actual time=550.999..550.999 rows=0 loops=1)
   Filter: (id <> id)
   Rows Removed by Filter: 4494173
 Planning time: 0.134 ms
 Execution time: 551.041 ms

可以看到，時間和有排序時差別不大。

如何計算查詢成本

執行一個分析，結果如下：

postgres=> explain select * from account where date_created ='2016-04-07 18:51:30.371495+08';
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Seq Scan on account  (cost=0.00..127716.33 rows=1 width=211)
   Filter: (date_created = '2016-04-07 18:51:30.371495+08'::timestamp with time zone)
(2 rows)

EXPLAIN引用的數據是：

0.00 預計的啓動開銷(在輸出掃描開始之前消耗的時間，比如在一個排序節點裏做排續的時間)。
127716.33 預計的總開銷。
1 預計的該規劃節點輸出的行數。
211 預計的該規劃節點的行平均寬度(單位：字節)。

這裏開銷(cost)的計算單位是磁盤頁面的存取數量，如1.0將表示一次順序的磁盤頁面讀取。其中上層節點的開銷將包括其所有子節點的開銷。這裏的輸出行數(rows)並不是規劃節點處理/掃描的行數，通常會更少一些。一般而言，頂層的行預計數量會更接近於查詢實際返回的行數。
這裏表示的就是在只有單 CPU 內核的情況下，評估成本是127716.33;

計算成本，Postgresql 首先看錶的字節數大小

這裏 account 表的大小爲：

postgres=> select pg_relation_size('account');

pg_relation_size
------------------
        737673216
(1 row)

查看塊的大小

Postgresql 會爲每個要一次讀取的快添加成本點，使用 show block_size查看塊的大小：

postgres=> show block_size;

block_size
------------
 8192
(1 row)

計算塊的個數

可以看到每個塊的大小爲8kb，那麼可以計算從表從讀取的順序塊成本值爲：

blocks = pg_relation_size/block_size = 90048

90048 是account 表所佔用塊的數量。

查看每個塊需要的成本

postgres=> show seq_page_cost;
 seq_page_cost
---------------
 1
(1 row)

這裏的意思是 Postgresql 爲每個塊分配一個成本點，也就是說上面的查詢需要從90048個成本點。

處理每條數據 cpu 所需時間

cpu_tuple_cost：處理每條記錄的CPU開銷（tuple：關係中的一行記錄）
cpu_operator_cost：操作符或函數帶來的CPU開銷。

postgres=> show cpu_operator_cost;
 cpu_operator_cost
-------------------
 0.0025
(1 row)

postgres=> show cpu_tuple_cost;
 cpu_tuple_cost
----------------
 0.01
(1 row)

計算

cost 計算公式爲：

cost = 磁盤塊個數 塊成本（1） + 行數 cpu_tuple_cost（系統參數值）+ 行數 * cpu_operator_cost

現在用所有值來計算explain 語句中得到的值：

number_of_records = 3013466  # account 表 count

block_size = 8192  # block size in bytes

pg_relation_size=737673216

blocks = pg_relation_size/block_size = 90048

seq_page_cost = 1
cpu_tuple_cost = 0.01
cpu_operator_cost = 0.0025

cost = blocks * seq_page_cost + number_of_records * cpu_tuple_cost + number_of_records * cpu_operator_cost

如何降低查詢成本？

直接回答，使用索引。

postgres=> explain select * from account where id=20039;
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Index Scan using account_pkey on account  (cost=0.43..8.45 rows=1 width=211)
   Index Cond: (id = 20039)
(2 rows)

通過這個查詢可以看到，在使用有索引的字段查詢時，查詢成本顯著降低。

索引掃描的計算比順序掃描的計算要複雜一些。它由兩個階段組成。
PostgreSQL會考慮random_page_cost和cpu_index_tuple_cost 變量，並返回一個基於索引樹的高度的值。

SQLAlchemy in 查詢空列表問題分析

問題場景

一個測試

如何計算查詢成本

計算成本，Postgresql 首先看錶的字節數大小

查看塊的大小

計算塊的個數

查看每個塊需要的成本

處理每條數據 cpu 所需時間

計算

如何降低查詢成本？

參考鏈接

druid數據源 xml配置

學習單元測試，告別祈禱式編程

如何通過MD5反查身份證號？

Redis 選擇hash還是string 存儲數據？

PostgreSQL JSONB 使用入門

併發模型：線程與鎖

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結