apache kafka log 存儲格式




Log Format




A log for a topic named "my_topic" with two partitions consists of two directories (namely my_topic_0 and my_topic_1) populated with data files containing the messages for that topic.


The format of the log files is a sequence of "log entries"";
一個命名爲"my_topic"的topic擁有兩個分區,由兩個目錄組成,目錄裏面包含數據爲消息的數據文件。
這些log文件的格式是一串log入口。


each log entry is a 4 byte integer N storing the message length which is followed by the N message bytes. 


每一個log入口是一個4字節N整數保存着消息的長度,接下來是N個字節的消息體。


 Each message is uniquely identified by a 64-bit integer offset giving the byte position of the start of this message in the stream of all messages ever sent to that topic on that partition. 


每一個消息被一個64字節的offset 整數唯一標識着。offset 是所有被髮送到某一個topic特定分區的消息的開始位置。


The on-disk format of each message is given below.


每一個消息的格式如下圖展示。


 Each log file is named with the offset of the first message it contains. 
 
 每一個log文件以包含的第一個開始消息的offset來命名。
 
  So the first file created will be 00000000000.kafka, and each additional file will have an integer name roughly S bytes from the previous file where S is the max log file size given in the configuration.
  
  所以,第一個文件以00000000000.kafka命名,並且每增加一個文件,文件名稱將是前一個文件的字節總數之和。並且每一個log文件的最大字節max是可以通過配置文件配置的。
  
  
  
  The use of the message offset as the message id is unusual. 
  使用消息的offset就像使用消息的ID一樣,是獨一無二的。
  
   Our original idea was to use a GUID generated by the producer, and maintain a mapping from GUID to offset on each broker. 
   我們最初的想法是通過producer使用GUID來生成的方法產生offset,並且在每一個broker把GUID和offset進行映射。
   
   
   But since a consumer must maintain an ID for each server, the global uniqueness of the GUID provides no value.
   但是因爲一個consumer 必須擁有每一個server的ID,GUID的全局唯一性不存在這個值。
   
   
   
    Furthermore, the complexity of maintaining the mapping from a random id to an offset requires a heavy weight index structure which must be synchronized with disk, essentially requiring a full persistent random-access data structure.
    更深入的是,從一個隨機的ID映射到offset需要一個複雜的索引結果,並且需要同步持續化到硬盤上面。
    
    
    Thus to simplify the lookup structure we decided to use a simple per-partition atomic counter which could be coupled with the partition id and node id to uniquely identify a message; 
    
    所以,爲了簡化查詢結構,我們決定使用一個簡單的在每一個分區上面使用自動增長的計數器,並且這個計數器可以組合分區ID和節點ID來識別這個消息。
    
    this makes the lookup structure simpler, though multiple seeks per consumer request are still likely.
    
    However once we settled on a counter, the jump to directly using the offset seemed natural—both after all are monotonically increasing integers unique to a partition.
    
    然而,一旦我們設置了計數器,在分區上使用線性增長的計數器似乎更加自然。
    Since the offset is hidden from the consumer API this decision is ultimately an implementation detail and we went with the more efficient approach.
    並且,offset在consumer的api中被隱藏了,我們這個決定似乎更加的高效。
    
    
    Writes
    
    The log allows serial appends which always go to the last file.
    
    This file is rolled over to a fresh file when it reaches a configurable size (say 1GB). 
    
    The log takes two configuration parameters: M, which gives the number of messages to write before forcing the OS to flush the file to disk, and S, which gives a number of seconds after which a flush is forced. 
    
    his gives a durability guarantee of losing at most M messages or S seconds of data in the event of a system crash. 
    
    log 文件運行多個消息被添加最後一個文件的後面。 當文件達到一個最大的配置大小時,文件將被覆蓋掉。
    log文件用於兩個配置參數:M 參數表示強制操作系統可以緩存多少消息,然後才把文件持久化到硬盤。 S參數表示多少秒一次把操作系統緩存中的消息持久化到硬盤中
    
   這種持久化操作確保當系統宕機時僅僅最多丟失M條消息或者丟失S秒前的數據。
   
   
   Reads
   
   Reads are done by giving the 64-bit logical offset of a message and an S-byte max chunk size.
   
   讀取操作通過一個64位的消息的邏輯offset和一個S區塊大小。
   This will return an iterator over the messages contained in the S-byte buffer. 
   這樣會迭代包含一個S字節大小的Buffer.
   
    S is intended to be larger than any single message, but in the event of an abnormally large message, the read can be retried multiple times, each time doubling the buffer size, until the message is read successfully.
    S比單個的消息大,但是也不會無限大。在找到消息前會迭代多次。
    
    
    A maximum message and buffer size can be specified to make the server reject messages larger than some size, and to give a bound to the client on the maximum it needs to ever read to get a complete message.
    
   It is likely that the read buffer ends with a partial message, this is easily detected by the size delimiting.
   
   
   
   The actual process of reading from an offset requires first locating the log segment file in which the data is stored, calculating the file-specific offset from the global offset value, and then reading from that file offset.
   
   實際上,將會從第一個log片段文件的offset開始讀取,並且計算全局的offset,然後讀取文件的offset.
   
   The search is done as a simple binary search variation against an in-memory range maintained for each file. 
   這個方式是在內存中保存每一個文件,並且通過一個簡單的二分搜索方式來查詢offset.
   
   The log provides the capability of getting the most recently written message to allow clients to start subscribing as of "right now". 
   log會緩存最近寫入到分區的消息,以便客戶端可以從 “right now”開始訂閱消息。
   
   This is also useful in the case the consumer fails to consume its data within its SLA-specified number of days. 
   
    In this case when the client attempts to consume a non-existent offset it is given an OutOfRangeException and can either reset itself or fail as appropriate to the use case. 
    
    
    
    
    http://kafka.apache.org/documentation/#log
   
    
     
    
    
       
    

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章