基於Kubernetes實現的大數據採集與存儲實踐總結

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"一、前言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    近來我司部門內部搭建的電商大數據平臺一期工程進入了尾聲工作,不僅在技術上短期內從零到一搭建起屬於團隊的大數據平臺,而且在業務上可以滿足多方訴求。筆者很有幸參與到其中的建設,在給優秀的團隊成員點讚的同時,也抽空整理了一下文檔,那麼今天就和大家來聊一下我們是如何結合Kubernetes實現數據採集與存儲的,談談裏面實現方案、原理和過程。這裏筆者放一張我們前期設計時借鑑阿里的大數據架構圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/89/891e47430533feac72d4cdd78aa0c476.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    本文重點講述的是上圖中『數據採集』部分,暫不涉及『數據計算』和『數據服務』的過程。在數據採集中,我們通過運行在Kubernetes中的清洗服務,不斷地消費Kafka中由定時任務爬取的業務數據,並通過Fluentbit、Fluentd等日誌採集工具對容器中打印到標準輸出的數據壓縮存儲至AWS S3中。如果你對這塊有興趣的話,那就一起開始今天的內容吧。"}]},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"二、基礎篇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2.1 Docker日誌管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    我們的應用服務都運行在Docker容器中,Docker的日誌有兩種:dockerd運行時的引擎日誌和容器內服務產生的容器日誌。在這裏我們不用關心引擎日誌,容器日誌是指到達標準輸出(stdout)和標準錯誤輸出(stderr)的日誌,其他來源的日誌不歸Docker管理,而Docker將所有容器打到 stdout 和 stderr 的日誌通過日誌驅動統一重定向到某個地方。Docker支持的日誌驅動有很多,比如 local、json-file、syslog、journald 等等,不同的日誌驅動可以將日誌重定向到不同的地方。Docker以熱插拔的方式實現日誌不同目的地的輸出,體現了管理的靈活性。其中默認的日誌驅動是json-file,該驅動將日誌以json 的形式重定向存儲到本地磁盤,其存儲格式爲:/var/lib/docker/containers//-json.log。筆者畫了一張簡易的流轉"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/16/16884353fdc3139b5a4999ba80f810b3.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    官方支持的日誌驅動很多,詳情看可以自行查閱Docker Containers Logging(https://docs.docker.com/config/containers/logging/configure)。我們可以通過docker info | grep Loggin命令查看Docker的日誌驅動配置,也可以通過--log-driver或者編寫/etc/docker/daemon.json 文件配置Docker容器的驅動:"}]},{"type":"codeblock","attrs":{"lang":"javascript"},"content":[{"type":"text","text":"{\n \"log-driver\": \"syslog\"\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    本實踐使用的是Docker默認驅動,即json file,這裏大家對Docker的日誌流轉有基本的認識即可。需要關注的是每種Docker日誌驅動都有相應的配置項日誌輪轉,比如根據單個文件大小和日誌文件數量配置輪轉。json-file 日誌驅動支持的配置選項如下:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"max-size:切割之前日誌的最大大小,可取值單位爲(k,m,g), 默認爲-1(表示無限制);"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"max-file:可以存在的最大日誌文件數,如果切割日誌會創建超過閾值的文件數,則會刪除最舊的文件,僅在max-size設置時有效,默認爲1;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"labels:適用於啓動Docker守護程序時,此守護程序接受的以逗號分隔的與日誌記錄相關的標籤列表;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"env:適用於啓動Docker守護程序時,此守護程序接受的以逗號分隔的與日誌記錄相關的環境變量列表;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"compress:切割的日誌是否進行壓縮,默認是disabled;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"詳見:https://docs.docker.com/config/containers/logging/json-file"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2.2 Kubernetes日誌管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    在Kubernetes中日誌種類也分爲兩種:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在容器中運行kube-scheduler和kube-proxy;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不運行在容器中運行時的kubelet和容器運行時(如Docker);"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    在使用 systemd 機制的服務器上,kubelet 和容器運行時將日誌寫入到 journald;如果沒有 systemd,他們將日誌寫到 /var/log 目錄的 .log 文件中。容器中的系統組件通常將日誌寫到 /var/log 目錄,在 kubeadm 安裝的集羣中它們以靜態 Pod 的形式運行在集羣中,因此日誌一般在 /var/log/pods 目錄下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    需要強調的一點是,對於應用POD日誌"},{"type":"text","marks":[{"type":"strong"}],"text":"Kuberntes並不管理日誌的輪轉策略,且日誌的存儲都是基於Docker的日誌管理策略進行"},{"type":"text","text":"。在默認的日誌驅動中,kubelet 會爲每個容器的日誌創建一個軟鏈接,軟鏈接存儲路徑爲:/var/log/containers/,軟鏈接會鏈接到 /var/log/pods/ 目錄下相應 pod 目錄的容器日誌,被鏈接的日誌文件也是軟鏈接,最終鏈接到 Docker 容器引擎的日誌存儲目錄:即/var/lib/docker/container 下相應容器的日誌。這些軟鏈接文件名稱含有 k8s 相關信息,比如:Pod id,名字空間,容器 ID 等信息,這就爲日誌收集提供了很大的便利。簡圖如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6d/6d94aa38d054f686bda834a9b0b9e544.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"三、進階篇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    在Kubernetes中比較流行的數據收集方案是Elasticsearch、Fluentd和Kibana技術棧,也是官方現在比較推薦的一種方案。我們在這裏只使用到EFK技術棧中的F,即Fluentd以及它的衍生品Fluent Bit。Fluentd和Fluent Bit都致力於收集、處理和交付日誌數據。但是兩個項目之間存在一些主要差異,使它們適合於不同的任務:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Fluentd:旨在彙總來自多個輸入的日誌,對數據進行處理然後路由至不同的輸出。它的引擎具有性能卓越的隊列處理線程,可快速使用和路由大批日誌。同時具有豐富的輸入和輸出插件生態系統(超過650個);"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Fluent Bit:被設計爲在高度受限的計算能力和減少的開銷(內存和CPU)成爲高度關注的分佈式計算環境中運行,因此它非常的輕巧(KB級別)和高性能,適合做日誌收集,處理和轉發,但不適用於日誌聚合;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    我們用官方的一張圖來對比它們的差異:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c4/c4199e2ccdb47a2c08f7748c4b5ee97f.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Fluend和FluenBit都有着相似的數據流處理方式,包括Input、Parse、Filter、Buffer、Routing和Ouput等組件,官方圖如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c1/c18892ff02bf7e96f7783ce8516c3eab.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Input: 提供了多種的輸入插件用於收集不同來源的信息,如日誌文件或者操作系統信息等;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"Parser: 解析器用來解析原始的字符串成結構化的信息(如json格式);"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"Filter: 過濾器用來在分發事件之前修改或者過濾事件;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"Buffer: 提供數據的緩存機制,優先使用內存,其次是文件系統;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"Router: 將不同分類的數據發送到不同的輸出中,一般使用Tag和Match來實現;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"Output: 同樣提供了不同的輸出插件,包括遠程服務,本地文件或者標準輸出等;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    在本次實踐的過程中,我們使用Fluent Bit作爲日誌轉發器負責數據的採集和轉發,Fluentd作爲日誌聚合器負責數據的聚合與存儲,在系統中相互協作,充分發揮它們的特長,更多地文檔可以參考Fluentd(https://docs.fluentd.org)和FluentBit(https://docs.fluentbit.io/manual)的官網。"}]},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"四、架構篇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    由容器引擎或runtime提供的原生功能通常不足以滿足完整的日誌記錄需求,當發生容器崩潰、pod 被逐出或節點宕機等情況,如果仍然想訪問到應用日誌,那麼日誌就應該具有獨立的存儲和生命週期。我們利用容器化後的應用寫入 stdout 和 stderr 的任何數據,都會被容器引擎捕獲並被重定向到某個位置的特性,使用日誌轉發器Fluentbit負責採集日誌並將數據推送到日誌聚合器Fluentd後,再由Fluentd負責聚合和存儲數據至AWS S3中。由於日誌轉發器必須在每個節點上運行,因此它可以用DaemonSet副本,而日誌聚合器則可以按需擴容縮容,因此我們使用Deployment來部署。筆者簡單畫的架構圖如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b8/b8c53884b64f626939c56e6a4e446934.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"五、實踐篇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    前面講了一大推的基礎理論和架構,終於到了實踐的時候了,這裏我們需要準備基本的實踐環境,包括:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"擁有DockerHub賬號,用於存放Docker鏡像;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"擁有Kubernetes集羣,用於編排容器和部署應用;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    接下來筆者準備了三個服務的代碼示例,包括負責接收和清洗業務數據的服務、採集日誌並轉發的FluentBit還有聚合數據並壓縮存儲的Fluentd。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"5.1 清洗服務"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    我們使用zap作爲日誌庫,不斷進行打印數據以模擬清洗服務處理業務邏輯,編寫的Sample代碼如下:"}]},{"type":"codeblock","attrs":{"lang":"go"},"content":[{"type":"text","text":"\npackage main\n\nimport (\n \"time\"\n\n \"go.uber.org/zap\"\n)\n\nfunc main() {\n logger, _ := zap.NewProduction()\n defer logger.Sync() // flushes buffer, if any\n sugar := logger.Sugar()\n\n for {\n sugar.Infow(\"just a example\",\n \"author\", \"tony\",\n \"today\", time.Now().Format(\"2006-01-02 15:04:05\"),\n \"yesterday\", time.Now().AddDate(0, 0, -1).Format(\"2006-01-02 15:04:05\"),\n )\n time.Sleep(time.Duration(5) * time.Second)\n }\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    接着我們編寫構建腳本,把它打包成鏡像以供後續Kubernetes集羣部署deployment使用,Dockerfile如下:"}]},{"type":"codeblock","attrs":{"lang":"shell"},"content":[{"type":"text","text":"# build stage\nFROM golang:latest as builder\nLABEL stage=gobuilder\nWORKDIR /build\nCOPY . .\nRUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -ldflags=\"-w -s\" -o example\n\n# final stage\nFROM scratch\nCOPY --from=builder /build/example /\nEXPOSE 8080\nENTRYPOINT [\"/example\"]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 執行以下命令完成構建並推上docker hub上:"}]},{"type":"codeblock","attrs":{"lang":"shell"},"content":[{"type":"text","text":"# build\ndocker build -t /logging:latest . && docker image prune -f .\n# push\ndocker push /logging:latest"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 最後我們的代碼部署在Kubernetes集羣中,模擬運行我們的清洗服務:"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: example\n namespace: logging\n labels:\n app: example\nspec:\n replicas: 1\n selector:\n matchLabels:\n app: example\n template:\n metadata:\n labels:\n app: example\n spec:\n containers:\n - name: example\n image: /logging:latest\n resources:\n limits:\n cpu: 100m\n memory: 200Mi\n requests:\n cpu: 10m\n memory: 20Mi\n ports:\n - containerPort: 24224\n terminationGracePeriodSeconds: 30\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"5.2 日誌轉發器FluentBit"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    Fluent Bit作爲日誌轉發器需要負責數據的採集和轉發,它需要的準備基本授權文件、項目配置以及daemon部署文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"授權文件"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\n# fluentbit_rbac.yaml\napiVersion: rbac.authorization.k8s.io/v1beta1\nkind: ClusterRole\nmetadata:\n name: fluentbit-read\nrules:\n- apiGroups: [\"\"]\n resources:\n - namespaces\n - pods\n verbs: [\"get\", \"list\", \"watch\"]\n---\napiVersion: rbac.authorization.k8s.io/v1beta1\nkind: ClusterRoleBinding\nmetadata:\n name: fluentbit-read\nroleRef:\n apiGroup: rbac.authorization.k8s.io\n kind: ClusterRole\n name: fluentbit-read\nsubjects:\n- kind: ServiceAccount\n name: fluentbit\n namespace: logging\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"項目配置"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"kind: ConfigMap\nmetadata:\n name: fluentbit-config\n namespace: logging\napiVersion: v1\ndata:\n fluent-bit.conf: |-\n [SERVICE]\n Flush 1\n Daemon Off\n Log_Level info\n Parsers_File parsers.conf\n HTTP_Server On\n HTTP_Listen 0.0.0.0\n HTTP_Port 2020\n [INPUT]\n Name tail\n Tag kube.*\n # Path /var/log/containers/*.log\n Path /var/log/containers/*logging_example*.log\n Parser docker\n DB /var/log/flb_kube.db\n Mem_Buf_Limit 5MB\n Skip_Long_Lines On\n Refresh_Interval 10\n Ignore_Older 24h\n [FILTER]\n Name kubernetes\n Match kube.*\n Kube_URL https://kubernetes.default.svc:443\n Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt\n Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token\n Kube_Tag_Prefix kube.var.log.containers.\n Merge_Log On\n Merge_Log_Key log_processed\n K8S-Logging.Parser On\n K8S-Logging.Exclude Off\n [OUTPUT]\n Name forward\n Match *\n Host ${FLUENTD_HOST}\n Port ${FLUENTD_PORT}\n Time_as_Integer True\n parsers.conf: |-\n [PARSER]\n Name apache\n Format regex\n Regex ^(?[^ ]*) [^ ]* (?[^ ]*) \\[(?
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章