網絡表情NLP（二）︱特殊表情包+emoji識別

原創

2020-06-06 23:06

這是一篇一本正經無聊的小研究項目。。
互聯網現在面臨很多新網絡文體，比如彈幕文體、小紅書的種草文體、網名等，這些超短文本中本身字符特徵就比較少，但是表情包占比卻很多，這是重要信息呀。
之前參加比賽，一般都是當作停用詞直接刪掉，在這些超短文本中可就不行了。

相關代碼+數據可見我的github：py-yanwenzi
相關文章：
網絡表情NLP（一）︱顏文字表情實體識別、屬性檢測、新顏發現
 網絡表情NLP（二）︱特殊表情包+emoji識別

文章目錄

幾種特殊符號：顏文字，emoji，特殊標號.
其中，emoji,特殊符號都是可以分詞分開的，
但是顏文字字數比較多，分詞的時候會佔着比較多的內容且不太好分

從符號大全這個網址來看，裏面有比較多的單個特殊符號。

在py-yanwenzi的data中有一份xlsx，pecial_symbols.xlsx就是一些收集與整理。

1 emoji表情識別

github：https://github.com/carpedm20/emoji

安裝：

$ !pip install emoji

相關教程：

import emoji
emoji_str = "python is 👍"

# 特殊字符轉換爲正常字符（相當於解碼）
strs = emoji.demojize(emoji_str)
print(strs)

# 正常字符轉換爲特殊字符（相當於編碼）
emoji_str = emoji.emojize(strs)
print(emoji_str)

# emoji的個數
print(emoji.emoji_count(emoji_str))

# emoji有哪些？list出來
print(emoji.emoji_lis(emoji_str))

輸出得內容：

python is :thumbs_up:
python is 👍
1
[{'location': 10, 'emoji': '👍'}]

2 通過正則來判定

主要參考了EmojiHandle，感謝這位作者。

2.1 判斷是否是表情

from collections import defaultdict
import re

frequencies = defaultdict(int)

#判斷是否是表情
def isEmoji(content):
    if not content:
        return False
    if u"\uE000" <= content and content <= u"\uE900":
        return True
    if u"\U0001F000" <= content and content <= u"\U0001FA99":
        return True
    #以下代碼被上面的範圍包含了
    if u"\U0001F600" <= content and content <= u"\U0001F64F":
        return True
    elif u"\U0001F300" <= content and content <= u"\U0001F5FF":
        return True
    elif u"\U0001F680" <= content and content <= u"\U0001F6FF":
        return True
    elif u"\U0001F1E0" <= content and content <= u"\U0001F1FF":
        return True
    else:
        return False


content = "👍"
isEmoji(content)
# True
content = "python is 👍"
isEmoji(content)
# False

這裏是對單一字符進行判定。

2.2 特殊符號編碼映射關係

數據可見我的github：py-yanwenzi

'''
獲取SoftBank與WeChat的Emoji映射表
'''

from collections import defaultdict

frequency = defaultdict(int)
frequency1 = defaultdict(int)
frequency2 = defaultdict(int)

def getReflactTbl(filename):
    frequencies = defaultdict(int)
    with open(filename, 'r', encoding='utf-8-sig') as f:
        for line in f:
            line = line.split()
            frequencies[line[0]] = line[1]
        print(frequencies)
    return frequencies

def getStandordTbl(filename):
    frequency1 = defaultdict(int)
    with open(filename, 'r', encoding='utf-8-sig') as f:
        for newline in f:
            while newline.find('fully-qualified     # ') > -1 or newline.find('; non-fully-qualified # ') > -1:
                startpos = newline.find('# ') + 2
                # print(startpos)
                endpos = newline.find(' ', startpos + 1)
                # print(endpos)
                meaning = newline[startpos:endpos]
                emoji_value = newline[endpos + 1:len(newline)]
                emoji_value = meaning.encode('unicode-escape').decode('utf-8').replace('\\U','').upper()
                frequency1[meaning] = emoji_value.replace('\n', '')
                newline = f.readline()
    print(frequency1)
    return frequency1
def getWechatTbl(filename):
    frequency2 = defaultdict(int)
    with open(filename, 'r', encoding='utf-8-sig') as f:
        for newline in f:
            while newline.find('fully-qualified     # ') > -1 or newline.find('; non-fully-qualified # ') > -1:
                startpos = newline.find('# ') + 2
                # print(startpos)
                endpos = newline.find(' ', startpos + 1)
                # print(endpos)
                emoji_value = newline[startpos:endpos]
                meaning = emoji_value.encode('unicode-escape').decode('utf-8').replace('\\u','').upper()
                frequency2[meaning] = emoji_value
                newline = f.readline()
    print(frequency2)
    return frequency2
frequency = getReflactTbl('data\emoji.txt')
frequency1 = getStandordTbl('data\emoji-test.txt')
frequency2 = getWechatTbl('data\emoji-wechat.txt')

映射關係爲：

2.3 表情編碼

字符編碼問題，還是滿頭疼得。
將字符串的unicode值打印出來

u = "好"
u.encode('unicode-escape').decode('utf-8') 
>>> '\\u597d'
u.encode('utf-8')
>>> b'\xe5\xa5\xbd'
u.encode('unicode-escape')
>>> b'\\u597d'

識別表情

import re
def identifyEmoji(desstr):
    '''
    識別表情
    '''
    co = re.compile(r'\\u\w{4}|\\U\w{8}')
    print(co.findall(desstr))
    if len(co.findall(desstr)):
        return True
    else:
        return False

print(u'\U00010000')
a = '😁'.encode('unicode-escape').decode('utf-8')
print(a)
print(identifyEmoji(a))


>>> 𐀀
>>> \U0001f601
>>> ['\\U0001f601']
>>> True

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

網絡表情NLP（二）︱特殊表情包+emoji識別

文章目錄

1 emoji表情識別

2 通過正則來判定

2.1 判斷是否是表情

2.2 特殊符號編碼映射關係

2.3 表情編碼

關於遊戲付費的一點想法

我通過CKA和CKS啦！

TensorFlow-Serving的使用實戰案例筆記（tf=1.4）

python | 高效統計語言模型kenlm：新詞發現、分詞、智能糾錯

python | 關鍵詞快速匹配檢索小工具 pyahocorasick / ahocorapy

網絡表情NLP（一）︱顏文字表情實體識別、屬性檢測、新顏發現

練習題︱ python 協同過濾ALS模型實現：商品推薦 + 用戶人羣放大

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結