在工作中,碰到處理STL文件,有時候拿到的文件是二進制,有時候又是ASCII, 所以
想着寫個方法進行判斷,然後再選擇打開方式。
話不多說,上代碼!
enum FileTypeEnum
{
FileTypeUnknown,
FileTypeBinary,
FileTypeText
};
FileTypeEnum
DetectFileType(const char *filename,
unsigned long length,
double percent_bin)
{
if (!filename || percent_bin < 0)
{
return FileTypeUnknown;
}
FILE *fp = Fopen(filename, "rb");
if (!fp)
{
return FileTypeUnknown;
}
// Allocate buffer and read bytes
unsigned char *buffer = new unsigned char [length];
size_t read_length = fread(buffer, 1, length, fp);
fclose(fp);
if (read_length == 0)
{
return FileTypeUnknown;
}
// Loop over contents and count
size_t text_count = 0;
const unsigned char *ptr = buffer;
const unsigned char *buffer_end = buffer + read_length;
while (ptr != buffer_end)
{
if ((*ptr >= 0x20 && *ptr <= 0x7F) ||
*ptr == '\n' ||
*ptr == '\r' ||
*ptr == '\t')
{
text_count++;
}
ptr++;
}
delete [] buffer;
double current_percent_bin =
(static_cast<double>(read_length - text_count) /
static_cast<double>(read_length));
if (current_percent_bin >= percent_bin)
{
return FileTypeBinary;
}
return FileTypeText;
}
調用示例:
DetectFileType(filename,256,0.05);
算法原來很簡單:
- Up to ‘length’ bytes are read from the file, if more than ‘percent_bin’ %
- of the bytes are non-textual elements, the file is considered binary,
- otherwise textual. Textual elements are bytes in the ASCII [0x20, 0x7E]
- range, but also \n, \r, \t.
意思就是,從文件中讀取一段字符串,並統計字符串中非文本字符的數量,如果超過
字符串長度的百分之percent_bin,那麼就是二進制文件。
這裏文本字符包括 \n \r \t 以及ASCII碼值在[0x20, 0x7E]這個範圍的
整個文件不需要全部讀取到內存。