數據庫內核雜談(十八) :自動駕駛的數據庫系統

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"本文是數據庫內核系列文章之一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一期,內核雜談我們跟風來聊一個最近很火的話題:自動駕駛。最近各大互聯網公司,紛紛宣佈,all in自動駕駛,開始造車。等等,這不是數據庫內核雜談嗎?是的,從這一期開始,內核雜談介紹一個能自動駕駛的數據庫管理系統:這是CMU數據庫領域的教授Andy Pavlo的研究領域之一,self-driving database management。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"項目的名稱叫NoisePage(網址"},{"type":"link","attrs":{"href":"https:\/\/noise.page\/about\/?fileGuid=GjGyPpKQKkJdDHvt","title":"","type":null},"content":[{"type":"text","text":"https:\/\/noise.page\/about\/"}]},{"type":"text","text":")。NoisePage是一個從頭開發的致力於自動管理的數據庫系統。通過machine learning技術來調整,優化系統參數和對系統進行調優。目前支持的參數包括,但不限於,物理數據庫設計如如何建立索引,materialized views,如何對數據進行sharding;系統參數的調優,數據庫SQL語句的調優,以及硬件擴容策略。學術研究主要負責構建系統來支持這些自動化駕駛型調優,並且,通過預測未來的查詢語句workloads,來提前做好規劃,最終目的是構建一個儘量減少人爲干預的數據庫系統(額...這是想取代DBA的節奏啊)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看了一下相關的學術論文,已經發表了好幾篇了。而且,最近在Professor Pavlo搞的Vaccination Database Talks系列中他的Phd學生Lin Ma正好給了一個最新的presentation(https:\/\/www.youtube.com\/watch?v=YqW9Pq5488s&t=2047s),歡迎大家去學習一下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一期,一起來學習第一篇論文,Self-Driving Database Management Systems,發表於Conference on Innovative Data Systems Research(CIDR) 2017。對整個自動駕駛數據庫系統做了一個綜述,並分享了當時獲得的進展。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"簡介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據庫的誕生,SQL語句的盛行,最初想解決的問題就是把數據存儲,管理等複雜的邏輯隱藏起來,讓程序員可以通過SQL這個declarative language來描述想要獲得什麼樣的數據。但隨着幾十年數據庫系統的發展,系統本身變得越來越複雜, 需要配置,優化的地方越來越多。因此也催生了一類專門的職業,DBA (Database Administrator)。DBA的主要職責就是確保數據庫系統可以很好地支持現有的業務邏輯需求workloads,大到存儲,節點配置,小到某個GUC的設定來優化SQL語句,都屬於DBA的範疇(在我還在上大學那時,DBA可是個熱門職業,傳說Oracle的DBA工資可高了)。而隨着分佈式數據庫,雲原生數據庫的誕生,新的趨勢卻是返璞歸真,儘量把數據庫系統做得越來越傻瓜化,不需要複雜配置,就可以很好地運行業務邏輯。一是,系統越來越複雜,迭代越來越快,成爲一個好的DBA越發困難;二是隨着人工智能的進步,我們覺得機器可以通過算法來優化操作,做的比人類更好。人類是懶惰的,追求自動化,效率最大化是我們的本性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當下,已經有很多相關研究在這一領域開展。但是,本文指出,之前的研究大部分都集中在某一個單點領域,比如如何更好地設計物理數據庫,如何設計index,如何設計存儲格式,或者partition scheme;其他的一些研究旨在給出最優化的GUC調整來提高SQL語句的運行效率。文中也指出了現有研究的一些缺陷。比如,大部分的優化管理都並不是原生於數據庫系統中(外部構建);並且更多的是反應式地來提供優化,並不能主動去適應查詢語句的變化;大部分的優化都是關注局部,而不能從全局角度來考慮最優。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而本文介紹的研究,是如何從宏觀角度,支持自動管理整個數據庫系統。並且提出了一種新的架構,即自動駕駛模塊需要被構建在數據庫系統內部。文中所用的系統是Peloton,就是CMU database group自己構建的數據庫系統,用於嘗試和支持各種研究。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"要使數據庫自動駕駛,需要解決哪些問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個挑戰就是,如何預測未來的workloads,即哪些SQL語句會被執行。如果能夠準確預測這些語句,能解決什麼問題? 試想一下,即便只是瞭解到這些workloads屬於online transaction processing (OLTP) 或是online analytical processing (OLAP),數據庫已經可以進行優化:如果workloads屬於OLTP,系統可以選擇row-store(行存)來優化寫操作,反之,如果workloads屬於OLAP,系統可以選擇column-store(列存)來優化讀取操作。如果各種workloads都有,一個可行的優化是部署兩套系統,行存用來優化寫操作,然後將數據同步到列存數據庫中來支持讀取操作。另一種優化方式就是部署HTAP(hybrid transaction-analytical processing)數據庫來同時處理讀操作和寫操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了預測未來workloads,還有什麼信息可以幫助優化數據庫配置?第二個能想到的就是這些語句的資源使用情況。如,需要多少CPU,多少內存來執行SQL語句。如果能夠準確地預測這些信息,數據庫就可以提前做好準備,來確保性能不受影響。這就好比,DBA通常會把數據庫清理,或更新操作安排在業務需求量最小的時候,比如深夜。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在能夠預測未來workloads和resource utilization的情況下,數據庫系統可以支持哪些優化呢?本文給出了三個大類:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)physical design(物理設計):比如添加或者刪除index,materialized view來加速查詢或者插入操作,選擇row store或者column store來優化寫操作或者讀操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)data design(數據存儲):根據查詢語句的需求,對數據進行冷熱區分或者進行sharding操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)runtime(運行時優化):比如針對SQL語句的optimization,或者knob的tuning,再或者,根據查詢語句的容量大小來決定是否需要增加新的計算節點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/49\/491360dc76c6038527c55d11c664c5ff.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文中同時提到,非常重要的一點就是,自動優化的數據庫系統必須支持動態地對配置進行更改,致使優化配置能夠即時生效。試想,如果一個數據庫系統,需要重啓才能更新配置,那自動駕駛帶來的優化就需要大打折扣了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總結一下,本文認爲,一款自動駕駛的數據庫, 應該具備預測workloads和資源使用情況,提供多種優化操作,並且支持動態更新配置。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"自動駕駛架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要如何構建一個系統來實現上述提到的種種操作呢?文章介紹的架構基於Peloton,CMU數據庫小組自主構建的系統。下圖給出了具體架構圖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/63\/630c4bf492fa8d9e25b148a3c54660f4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前市場上的所有自動駕駛,都是人工智能驅動的。自動駕駛的數據庫也不例外。而人工智能最重要的就是,數據。因此,整個架構都是圍繞數據來打造的。結合架構圖,一個部件一個部件來看。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個介紹的部件就是用來做數據收集的workload monitor。Workload monitor負責收集所有workload運行的相關信息,除了SQL語句之外,數據操作量,資源使用等,都會被記錄。此外,還會定期收集數據庫系統隨時間流的硬件指標,比如CPU,內存,IO使用率等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"收集了原始數據,第二個部件負責workload classification(workload分類)。文中使用的是unsupervised learning(非監督型學習)來把具有類似屬性(characteristic)的語句分到一起。機器學習的術語是clustering,聚類。聚類的作用就是去重,只爲相似的語句保留一個model即可,讓預測更容易。文中使用的方法是DBSCAN,是一個已經被驗證過的聚類方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"聚類的挑戰在於,什麼feature屬性應該被考慮進去?這樣的屬性分成兩類,語句的運行時feature和語句的semantic logic。前者的好處在於,能夠將類似的語句分類在一起,甚至不需要去理解這些SQL語句,但這樣的modeling也會對數據庫的任何更改變得敏感,任何物理上的改動都可能使得預測變得不準確。另一個問題就是,當類似的語句在不同的併發程度下運行,那運行時屬性就可能完全不一樣,這同樣使得數據變得不準確。另一個方法就是從SQL語句的語義出發,比如物理執行計劃,讀取了哪些表,用到了哪些index,等等。這些信息往往獨立於數據庫本身。文中並沒有給出結論哪個更好,應該還處於探索中。我認爲,應該一股腦全交給ML model來自己決定。以我粗淺的對與深度學習的理解,它要解決的就是能夠減少feature engineering,讓model自己決定哪些feature更重要(歡迎指正)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在得到了workload cluster聚類後,要做的就是預測這些workload cluster未來什麼時候會再次出現。預測的作用在於能夠幫助數據庫系統更好地應對未來的workloads,提前做好配置優化或者擴容等。文中使用的技術,是對於每個執行的語句,tag預測後的clusterID,然後通過收集這些cluster出現的時間,來預測未來再次出現的時間(對於部署在生產環境中的數據庫,大部分的workloads應該是具備週期性的,比如定點的auto script等。因此,預測並不會特別困難)。文中提到了RNN以及long short-term memory (LSTM) 長,短記憶模型能夠更好地預測週期性和重複性的pattern,也提及了用多個RNN來預測模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預測了未來workloads的模型,下一步就是實施優化來提前讓數據庫做好準備。文中介紹了管理模塊,用來根據預測的模型做出優化決策並實施。第一個部件就是action generator,負責搜索action可以用來優化性能,這裏需要提到,優化函數就是減少語句運行的時間,提高性能。但文中同時提到,優化函數是可以擴展到throughtput,資源消耗等等。那系統是如何做的呢?首先,系統存儲了一系列可能的優化action,並記錄了它們對數據庫的影響,相當於step function。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"萬事俱備,只欠東風。有了action庫,有了預測workload的模型,目前的數據庫配置,就可以根據目標函數來產生action plan來逐漸逼近目標。文中使用的方法是control theory, 控制理論來操作。文中提到了receding-horizon control model (RHCM)模型,是被廣泛應用在真.自動駕駛汽車中的模型。 RHCM的工作原理,簡單而言就是,在每個時間週期內,用模型預測可能出現的workloads,然後在action catalog中搜索一系列的action來使得目標函數得到最小值(逼進優化目標)。但是,數據庫只把這一系列的action中的第一個apply到數據庫當中,等待數據庫響應這個action,然後到下一個週期,再重複上述操作。整個優化過程就可以被類比爲一個對樹狀圖的搜索過程,從根節點作爲第一個time epoch。作者也強調,爲什麼需要一個高性能,並且可以動態更新配置的數據庫來支持自動駕駛。因爲需要對更新過得數據庫系統重新收集數據來判斷加載的action是否產生了效果。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讀完整篇文章,給我最大的感受就是,一開始提到如何自動駕駛數據庫系統,完全無頭緒。但是作者通過把問題分解成各個步驟,大家是否也覺得,整個邏輯通順。並且,會覺得,真的,好像這樣一步一步去做,就可以實現。 我覺得這就是一流學者所做的,能把一個複雜的問題想清楚,抽絲剝繭,然後找出解決方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"感謝閱讀!下一期,我們接着學習第一個模塊,workload forecasting。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章