UCS-UNICODE-UTF-8編碼

對每一個字符采用四個8比特字節編碼的稱爲UCS-4,對每一個字符采用兩個8比特字節編碼的稱爲UCS-2。

 

UTF-8定義:
在UTF-8中,字符采用1到6個8比特字節的序列進行編碼。僅僅一個8比特字節的一個序列中,字節的高位爲0,其他的7位用於字符值編碼。n(n>1)個8比特字節的一個序列中,初始的8比特字節中高n位爲1,接着一位爲0,此字節餘下的位包含被編碼字符值的位。接着的所有8比特字節的最高位爲1,接着下一位爲0,餘下每個字節6位包含被編碼字符的位。
下表總結了這些不同的8比特字節類型格式。字母x指出此位來自於進行編碼的UCS-4字符值。

UCS-4範圍(16進制) UTF-8 系列(二進制)

 0000 0000<->0000 007F 0xxxxxxx

0000 0080<->0000 07FF 110xxxxx 10xxxxxx

0000 0800<->0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx

0001 0000<->001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

0020 0000<->03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

 0400 0000<->7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx

 

int UnicodeToUTF8(WCHAR ucs2, unsigned char *buffer)
{
memset(buffer, 0, 4);
if ((0x0000 <= ucs2) && (ucs2 <= 0x007f)) // one char of UTF8
{
buffer[0] = (char)ucs2;
return 1;
}
if ((0x0080 <= ucs2) && (ucs2 <= 0x07ff)) // two char of UTF8
{
buffer[1] = 0x80 | char(ucs2 & 0x003f);
buffer[0] = 0xc0 | char((ucs2 >> 6) & 0x001f);
return 2;
}
if ((0x0800 <= ucs2) && (ucs2 <= 0xffff)) // three char of UTF8
{
buffer[2] = 0x80 | char(ucs2 & 0x003f);
buffer[1] = 0x80 | char((ucs2 >> 6) & 0x003f);
buffer[0] = 0xe0 | char((ucs2 >> 12) & 0x001f);
return 3;
}
return 0;
}

一下是UTF8->unicode:

 

WCHAR UTF8ToUnicode(unsigned char *buffer)
{
WCHAR temp = 0;
if (buffer[0] < 0x80) // one char of UTF8
{
temp = buffer[0];
}
if ((0xc0 <= buffer[0]) && (buffer[0] < 0xe0)) // two char of UTF8
{
temp = buffer[0] & 0x1f;
temp = temp << 6;
temp = temp | (buffer[1] & 0x3f);
}
if ((0xe0 <= buffer[0]) && (buffer[0] < 0xf0)) // three char of UTF8
{
temp = buffer[0] & 0x0f;
temp = temp << 6;
temp = temp | (buffer[1] & 0x3f);
temp = temp << 6;
temp = temp | (buffer[2] & 0x3f);
}
if ((0x80 <= buffer[0]) && (buffer[0] < 0xc0)) // not the first byte of UTF8 character
return 0xfeff; // 0xfeff will never appear in usual

return temp; // more than 3-bytes return 0
}

UNICODE 0x678 = 110 0111 1000 = UFT8 1101 1001 1011 1000  用10專用了後六位

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章