背景:
數據庫包含重複數據,需要清理掉重複數據,並只保留其中一條。
結論
優化:百萬數據查詢刪除重複數據,耗時從5423秒下降到2秒左右
優化過程:
根據搜索到的資料:
4、刪除表中多餘的重複記錄(多個字段),只留有rowid最小的記錄
delete from vitae a
where (a.peopleId,a.seq) in (select peopleId,seq
from vitae group by peopleId,seq having count(*) > 1)
and rowid not in (select min(rowid) from vitae
group by peopleId,seq having count(*)>1)
根據搜索到的資料,編寫第一個版本的sql語句:
delete from lcfyjttz where
(fdate, ffjdm, flcdm, ffytype, fgsbz) in(
select fdate,ffjdm,flcdm, ffytype, fgsbz
from lcfyjttz group by fdate,ffjdm,flcdm, ffytype, fgsbz having count(1) > 1)
and rowid not in(
select min(rowid) as rid from lcfyjttz
group byfdate,ffjdm,flcdm, ffytype, fgsbz having count(1) > 1 )
百萬數據量的情況下,其執行結果如下:
看這個sql的執行結果就很嚇人,作爲一個追求3秒級的人,簡直忍受不了,開始嘗試優化這條sql。經歷一段自殘式的試錯,也終於是實現了。
優化後sql:
DELETE
FROM
LCFYJTTZ c
WHERE
EXISTS (
SELECT
a.ROWID
FROM
LCFYJTTZ a,
(
SELECT
fdate,
ffjdm,
flcdm,
ffytype,
fgsbz,
MIN( ROWID ) rid
FROM
lcfyjttz
GROUP BY
fdate,
ffjdm,
flcdm,
ffytype,
fgsbz
HAVING
count( 1 ) > 1
) b
WHERE
a.FDATE = b.FDATE
AND a.FFJDM = b.FFJDM
AND a.FLCDM = b.FLCDM
AND a.ffytype = b.FFYTYPE
AND a.ROWID != b.rid
AND c.ROWID = a.ROWID
)
其執行結果如下:
在優化過程中,還是學習到很多知識,比如in和exists關鍵字的使用,with...as的語法使用,我嘗試過用但沒用上。