如題,看到這個題目也許覺得功能有些多餘,字母、數字連在一塊的話,是不會單獨分出來的,分詞時候是連在一塊的,也算正常搜素需求。如輸入 :
String txt = "IBM12二次修改123"; 分詞效果:
i bm |123 | 二 | 次 | 修 | 改
現在,有一個需求:需要對字母、數字都分詞,分詞效果要達到:
i | b | m | 1 | 2 | 3 | 二 | 次 | 修 | 改
類似在數據庫中使用like加百分號雙向查詢效果,使用最初版本的mmseg4j無法滿足需求,經過閱讀mmseg4j部分源代碼,稍微修改了一點點,即可滿足需求(暫不考慮效率)。
- 未修改前通過單詞,可以查詢,通過字母查詢不到結果如下圖:
單詞完全匹配搜素:
字母模糊搜索:
- 修改mmseg4j源代碼MMSeg.java中的next部分代碼,其實就是屏蔽了部分代碼,很簡單:
public Word next() throws IOException {
//先從緩存中取
Word word = bufWord.poll();
if(word == null) {
bufSentence.setLength(0);
int data = -1;
boolean read = true;
// while(read && (data=readNext()) != -1) {
while((data=readNext()) != -1) {
read = false; //默認一次可以讀出同一類字符,就可以分詞內容
int type = Character.getType(data);
String wordType = Word.TYPE_WORD;
switch(type) {
case Character.UPPERCASE_LETTER:
case Character.LOWERCASE_LETTER:
case Character.TITLECASE_LETTER:
case Character.MODIFIER_LETTER:
/*
* 1. 0x410-0x44f -> А-я //俄文
* 2. 0x391-0x3a9 -> Α-Ω //希臘大寫
* 3. 0x3b1-0x3c9 -> α-ω //希臘小寫
*/
data = toAscii(data);
NationLetter nl = getNation(data);
if(nl == NationLetter.UNKNOW) {
read = true;
break;
}
wordType = Word.TYPE_LETTER;
bufSentence.appendCodePoint(data);
switch(nl) {
case EN:
//字母后面的數字,如: VH049PA
// ReadCharByAsciiOrDigit rcad = new ReadCharByAsciiOrDigit();
// readChars(bufSentence, rcad);
// if(rcad.hasDigit()) {
// wordType = Word.TYPE_LETTER_OR_DIGIT;
// }
//only english
//readChars(bufSentence, new ReadCharByAscii());
break;
case RA:
readChars(bufSentence, new ReadCharByRussia());
break;
case GE:
readChars(bufSentence, new ReadCharByGreece());
break;
}
bufWord.add(createWord(bufSentence, wordType));
bufSentence.setLength(0);
break;
case Character.OTHER_LETTER:
/*
* 1. 0x3041-0x30f6 -> ぁ-ヶ //日文(平|片)假名
* 2. 0x3105-0x3129 -> ㄅ-ㄩ //注意符號
*/
bufSentence.appendCodePoint(data);
readChars(bufSentence, new ReadCharByType(Character.OTHER_LETTER));
currentSentence = createSentence(bufSentence);
bufSentence.setLength(0);
break;
case Character.DECIMAL_DIGIT_NUMBER:
bufSentence.appendCodePoint(toAscii(data));
// readChars(bufSentence, new ReadCharDigit()); //讀後面的數字, AsciiLetterOr
wordType = Word.TYPE_DIGIT;
int d = readNext();
if(d > -1) {
if(seg.isUnit(d)) { //單位,如時間
bufWord.add(createWord(bufSentence, startIdx(bufSentence)-1, Word.TYPE_DIGIT)); //先把數字添加(獨立)
bufSentence.setLength(0);
bufSentence.appendCodePoint(d);
wordType = Word.TYPE_WORD; //單位是 word
} else { //後面可能是字母和數字
pushBack(d);
// if(readChars(bufSentence, new ReadCharByAsciiOrDigit()) > 0) { //如果有字母或數字都會連在一起.
// wordType = Word.TYPE_DIGIT_OR_LETTER;
// }
}
}
bufWord.add(createWord(bufSentence, wordType));
bufSentence.setLength(0); //緩存的字符清除
break;
case Character.LETTER_NUMBER:
// ⅠⅡⅢ 單分
bufSentence.appendCodePoint(data);
readChars(bufSentence, new ReadCharByType(Character.LETTER_NUMBER));
int startIdx = startIdx(bufSentence);
for(int i=0; i<bufSentence.length(); i++) {
bufWord.add(new Word(new char[] {bufSentence.charAt(i)}, startIdx++, Word.TYPE_LETTER_NUMBER));
}
bufSentence.setLength(0); //緩存的字符清除
break;
case Character.OTHER_NUMBER:
//①⑩㈠㈩⒈⒑⒒⒛⑴⑽⑾⒇ 連着用
bufSentence.appendCodePoint(data);
readChars(bufSentence, new ReadCharByType(Character.OTHER_NUMBER));
bufWord.add(createWord(bufSentence, Word.TYPE_OTHER_NUMBER));
bufSentence.setLength(0);
break;
default :
//其它認爲無效字符
read = true;
}//switch
}
// 中文分詞
if(currentSentence != null) {
do {
Chunk chunk = seg.seg(currentSentence);
for(int i=0; i<chunk.getCount(); i++) {
bufWord.add(chunk.getWords()[i]);
}
} while (!currentSentence.isFinish());
currentSentence = null;
}
word = bufWord.poll();
}
return word;
}
主要是註釋了一些代碼,對字母、數字不要連續處理。- 再次搜索字母查詢,效果如下:
綜上,這樣就簡單完成了數據庫中類似like和百分號雙向匹配需求。