從2018年google提出bert後,一直在使用bert模型作爲訓練基礎
經常會需要標註語料數據
在數據量極大的情況下,數萬標註後頭暈眼花,如何快速差錯也是一個問題
於是設置了3條規則作爲檢查的基本屬性,後續歡迎補充
前置:
我們有一個充滿label的標籤
還有一個已經做好標註的語料
1.標籤的正確性:
首先我們要保證每個字後面的標籤不會因爲我們的手誤或者複製粘貼出錯
# rule 1 判定正確性,保證語料標註中每行“一個字+空格+對應標籤”,語料之中句子與句子之前單獨使用一個回車\n來隔斷
# 無遺漏,無錯誤標籤
# label.txt 文件包含標籤目錄,同時最後一行無回車\n
with open("label.txt", 'r', encoding='utf-8') as fd:
content = fd.read()
label = content.split("\n")
print(label)
with open("corpus_2.txt", 'r', encoding='utf-8') as f4:
content = f4.read()
word_vector = content.split("\n")
# word_vector = [i for i in word_vector if(len(str(i)) != 0)]
print(word_vector)
num = 0
for word in word_vector:
num = num + 1
if word != "":
if word[0] == " ":
print(num)
print("wrong")
chinese = word.split(" ")
if chinese[1] not in label:
print(num)
print(chinese[1])
print("有標錯的標籤")
break
# print("標籤正確性檢測通過")
2.rule2:IBO的標註原則,B-X,I-X,,,不會出現以I-x作爲開頭的標註詞,同時,在一組標註詞中後綴標籤應該保持一致
# rule 2 test IBO 標準原則
# 我的數據語料標註了43k行
num = 43042
logo = []
# 把標籤組取出
for word in word_vector:
if word != "":
chinese = word.split(" ")
logo.append(chinese[1])
else:
logo.append(word)
#
for i in range(0, len(logo))[::-1]:
num = num - 1
temp = i
lab = logo[temp]
if len(lab) > 0 and logo[temp][0] == "I":
label_word = logo[temp][2:]
str_test = 'B-' + label_word
str_test_2 = 'I-' + label_word
# 設定報錯閾值 100
length = 1
while logo[temp] != str_test:
if logo[temp] != str_test_2:
print(num)
print("wrong")
return -1
temp = temp - 1
length = length + 1
print("IBO規則通過")
3rule3:在進行檢測結束後,我們在生成train.txt,dev,test等文件的時候應該將句子作爲整體進行打混,(爬蟲爬下來的時候會出現領域句子大部分聚在一起的情況)
# rule 3 按句子 將順序打混 並切分文件
with open("corpus_2.txt", 'r', encoding='utf-8') as fa:
content = fa.read()
# for w in range(0, len(content)):
# if content[w] == "\n":
# num = num + 1
# if content[w] == "\n" and content[w+1] == "\n" and content[w+2] == "\n":
# print(num)
# break
# if "\n\n\n" in content:
# print("aaaaa")
sentence_vector = content.split("\n\n")
print(sentence_vector)
random.shuffle(sentence_vector)
print(sentence_vector)
with open("corpus_1.txt", 'w', encoding='utf-8') as fw:
for sentence in sentence_vector:
fw.write(sentence)
fw.write("\n\n")
# 按比例拆分成文件後最終保證dev.txt,train.txt等以兩個\n結束作爲模型訓練的輸入
# 43000 * 8/10
print(len(sentence_vector))
with open("train.txt", 'w', encoding='utf-8') as ftr:
for ss in range(0, 1250):
ftr.write(sentence)
ftr.write("\n\n")
with open("dev.txt", 'w', encoding='utf-8') as ftr:
for ss in range(1250, 1850):
ftr.write(sentence)
ftr.write("\n\n")
with open("test.txt", 'w', encoding='utf-8') as ftr:
for ss in range(1850, len(sentence_vector)):
ftr.write(sentence)
ftr.write("\n\n")
這裏我遇到了一個問題,那是打亂後生成的corpus_1.txt長度與corpus_2.txt
原文不一致,我先後檢查了標籤和實體字長度,沒有發現問題,直到最後檢測出來是因爲我是以“\n\n”作爲切分條件的,而當時我
corpus_2.txt的最後一句話是以\n作爲結尾的,就會造成在生成新文件的時候新文件中有一句話自帶一個\n,所以會使新語料文件多一行,所以在切分之前一定要考慮好最後一句話的回車
以下圖爲例
a = "a\n\nb\n\nO"
b = a.split("\n\n")
random.shuffle(b)
print(b)
a = "a\n\nb\n\nO\n"
b = a.split("\n\n")
random.shuffle(b)
print(b)
a = "a\n\nb\n\nO\n\n"
b = a.split("\n\n")
random.shuffle(b)
print(b)
result:
['a', 'b', 'O']
['a', 'O\n', 'b']
['O', 'a', '', 'b']
目前考慮到的加速檢測就是這些,如果還有遺漏歡迎大家提醒分享