判斷文件是否爲二進制

在工作中,碰到處理STL文件,有時候拿到的文件是二進制,有時候又是ASCII, 所以

想着寫個方法進行判斷,然後再選擇打開方式。

話不多說,上代碼!

enum FileTypeEnum 
  { 
    FileTypeUnknown,
    FileTypeBinary,
    FileTypeText
  };

FileTypeEnum
DetectFileType(const char *filename,
                            unsigned long length,
                            double percent_bin)
{
  if (!filename || percent_bin < 0)
    {
    return FileTypeUnknown;
    }

  FILE *fp = Fopen(filename, "rb");
  if (!fp)
    {
    return FileTypeUnknown;
    }

  // Allocate buffer and read bytes

  unsigned char *buffer = new unsigned char [length];
  size_t read_length = fread(buffer, 1, length, fp);
  fclose(fp);
  if (read_length == 0)
    {
    return FileTypeUnknown;
    }

  // Loop over contents and count

  size_t text_count = 0;

  const unsigned char *ptr = buffer;
  const unsigned char *buffer_end = buffer + read_length;

  while (ptr != buffer_end)
    {
    if ((*ptr >= 0x20 && *ptr <= 0x7F) ||
        *ptr == '\n' ||
        *ptr == '\r' ||
        *ptr == '\t')
      {
      text_count++;
      }
    ptr++;
    }

  delete [] buffer;

  double current_percent_bin =
    (static_cast<double>(read_length - text_count) /
     static_cast<double>(read_length));

  if (current_percent_bin >= percent_bin)
    {
    return FileTypeBinary;
    }

  return FileTypeText;
}

調用示例:

DetectFileType(filename,256,0.05)

算法原來很簡單:

  • Up to ‘length’ bytes are read from the file, if more than ‘percent_bin’ %
  • of the bytes are non-textual elements, the file is considered binary,
  • otherwise textual. Textual elements are bytes in the ASCII [0x20, 0x7E]
  • range, but also \n, \r, \t.

意思就是,從文件中讀取一段字符串,並統計字符串中非文本字符的數量,如果超過

字符串長度的百分之percent_bin,那麼就是二進制文件。

這裏文本字符包括 \n \r \t 以及ASCII碼值在[0x20, 0x7E]這個範圍的

整個文件不需要全部讀取到內存。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章