lightLDA dump_binary格式分析

原始註釋:
/*
* Output file format:
* 1, the first 4 byte indicates the number of docs in this block
* 2, the 4 * (doc_num + 1) bytes indicate the offset of reach doc
* an example
* 3 // there are 3 docs in this block
* 0 // the offset of the 1-st doc
* 10 // the offset of the 2-nd doc, with this we know the length of the 1-st doc is 5 = 10/2
* 16 // the offset of the 3-rd doc, with this we know the length of the 2-nd doc is 3 = (16-10)/2
* 24 // with this, we know the length of the 3-rd doc is 4 = (24 - 16)/2
* w11 t11 w12 t12 w13 t13 w14 t14 w15 t15 // the token-topic list of the 1-st doc
* w21 t21 w22 t22 w23 t23 // the token-topic list of the 2-nd doc
* w31 t31 w32 t32 w33 t33 w34 t34 // the token-topic list of the 3-rd doc

 * the class block_stream helps generate such binary format file, usage:
 * int doc_num = 3;
 * int64_t* offset_buf = new int64_t[doc_num + 1];
 *
 * block_stream bs;
 * bs.open("block");
 * bs.write_empty_header(offset_buf, doc_num);
 * ...
 * // update offset_buf and doc_num...

 * bs.write_doc(doc_buf, doc_idx);
 * ...
 * bs.write_real_header(offset_buf, doc_num);
 * bs.close();
 */

分析:
兩篇文章的情況下,格式如下
2, 0, 0, 0, 59, 0, 130, 0, 0, 2270, 0, 2865, 0, 6357, 0, 7962, 0, 8110, 0, 8627, 0, 8760, 0, 8934, 0, 9104, 0, 9723, 0, 11089, 0, 11766, 0, 12608, 0, 12750, 0, 14119, 0, 17061, 0, 27641, 0, 45843, 0, 54110, 0, 66203, 0, 145784, 0, 187091, 0, 187631, 0, 187631, 0, 189015, 0, 189015, 0, 189015, 0, 1566513, 0, 3683883, 0, 0, 2270, 0, 2865, 0, 6357, 0, 6357, 0, 7962, 0, 8110, 0, 8627, 0, 8760, 0, 8934, 0, 9104, 0, 9723, 0, 11089, 0, 11766, 0, 12608, 0, 12750, 0, 14119, 0, 17061, 0, 27641, 0, 45843, 0, 54110, 0, 66203, 0, 145784, 0, 187091, 0, 187631, 0, 187631, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 1566513, 0, 3683883, 0

其中每個數字後都會跟一個0

2, 0, -------------2篇文章
0, 0, 59, 0, 130, 0, -----------第一篇文章的起止爲0/59,第二篇文章的起止地址爲59/130
0, -------------每篇文章開始處爲一個0
2270, 0, 2865, 0, 6357, 0, 7962, 0, 8110, 0, 8627, 0, 8760, 0, 8934, 0, 9104, 0, 9723, 0, 11089, 0, 11766, 0, 12608, 0, 12750, 0, 14119, 0, 17061, 0, 27641, 0, 45843, 0, 54110, 0, 66203, 0, 145784, 0, 187091, 0, 187631, 0, 187631, 0, 189015, 0, 189015, 0, 189015, 0, 1566513, 0, 3683883, 0, ---------文章每個詞id後面跟一個0
0,
2270, 0, 2865, 0, 6357, 0, 6357, 0, 7962, 0, 8110, 0, 8627, 0, 8760, 0, 8934, 0, 9104, 0, 9723, 0, 11089, 0, 11766, 0, 12608, 0, 12750, 0, 14119, 0, 17061, 0, 27641, 0, 45843, 0, 54110, 0, 66203, 0, 145784, 0, 187091, 0, 187631, 0, 187631, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 1566513, 0, 3683883, 0

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章