solr中mmseg4j支持單個字母、數字及組合搜索

如題，看到這個題目也許覺得功能有些多餘，字母、數字連在一塊的話，是不會單獨分出來的，分詞時候是連在一塊的，也算正常搜素需求。如輸入：

String txt = "IBM12二次修改123"; 分詞效果：

i bm |123 | 二 | 次 | 修 | 改

現在，有一個需求：需要對字母、數字都分詞，分詞效果要達到：

i | b | m | 1 | 2 | 3 | 二 | 次 | 修 | 改

類似在數據庫中使用like加百分號雙向查詢效果，使用最初版本的mmseg4j無法滿足需求，經過閱讀mmseg4j部分源代碼，稍微修改了一點點，即可滿足需求（暫不考慮效率）。

未修改前通過單詞，可以查詢，通過字母查詢不到結果如下圖：

單詞完全匹配搜素：

字母模糊搜索：

修改mmseg4j源代碼MMSeg.java中的next部分代碼，其實就是屏蔽了部分代碼，很簡單：

public Word next() throws IOException {
		//先從緩存中取
		Word word = bufWord.poll();
		if(word == null) {
			bufSentence.setLength(0);

			int data = -1;
			boolean read = true;
//			while(read && (data=readNext()) != -1) {
            while((data=readNext()) != -1) {
				read = false;	//默認一次可以讀出同一類字符,就可以分詞內容
				int type = Character.getType(data);
				String wordType = Word.TYPE_WORD;
				switch(type) {
				case Character.UPPERCASE_LETTER:
				case Character.LOWERCASE_LETTER:
				case Character.TITLECASE_LETTER:
				case Character.MODIFIER_LETTER:
					/*
					 * 1. 0x410-0x44f -> А-я	//俄文
					 * 2. 0x391-0x3a9 -> Α-Ω	//希臘大寫
					 * 3. 0x3b1-0x3c9 -> α-ω	//希臘小寫
					 */
					data = toAscii(data);
					NationLetter nl = getNation(data);
					if(nl == NationLetter.UNKNOW) {
						read = true;
						break;
					}
					wordType = Word.TYPE_LETTER;
					bufSentence.appendCodePoint(data);
					switch(nl) {
					case EN:
						//字母后面的數字,如: VH049PA
//						ReadCharByAsciiOrDigit rcad = new ReadCharByAsciiOrDigit();
//						readChars(bufSentence, rcad);
//						if(rcad.hasDigit()) {
//							wordType = Word.TYPE_LETTER_OR_DIGIT;
//						}
						//only english
						//readChars(bufSentence, new ReadCharByAscii());
						break;
					case RA:
						readChars(bufSentence, new ReadCharByRussia());
						break;
					case GE:
						readChars(bufSentence, new ReadCharByGreece());
						break;
					}
					bufWord.add(createWord(bufSentence, wordType));

					bufSentence.setLength(0);

					break;
				case Character.OTHER_LETTER:
					/*
					 * 1. 0x3041-0x30f6 -> ぁ-ヶ	//日文(平|片)假名
					 * 2. 0x3105-0x3129 -> ㄅ-ㄩ	//注意符號
					 */
					bufSentence.appendCodePoint(data);
					readChars(bufSentence, new ReadCharByType(Character.OTHER_LETTER));

					currentSentence = createSentence(bufSentence);

					bufSentence.setLength(0);

					break;
				case Character.DECIMAL_DIGIT_NUMBER:
					bufSentence.appendCodePoint(toAscii(data));
//					readChars(bufSentence, new ReadCharDigit());	//讀後面的數字, AsciiLetterOr
					wordType = Word.TYPE_DIGIT;
					int d = readNext();
					if(d > -1) {
						if(seg.isUnit(d)) {	//單位,如時間
							bufWord.add(createWord(bufSentence, startIdx(bufSentence)-1, Word.TYPE_DIGIT));	//先把數字添加(獨立)

							bufSentence.setLength(0);

							bufSentence.appendCodePoint(d);
							wordType = Word.TYPE_WORD;	//單位是 word
						} else {	//後面可能是字母和數字
							pushBack(d);
//							if(readChars(bufSentence, new ReadCharByAsciiOrDigit()) > 0) {	//如果有字母或數字都會連在一起.
//								wordType = Word.TYPE_DIGIT_OR_LETTER;
//							}
						}
					}

					bufWord.add(createWord(bufSentence, wordType));


					bufSentence.setLength(0);	//緩存的字符清除

					break;
				case Character.LETTER_NUMBER:
					// ⅠⅡⅢ 單分
					bufSentence.appendCodePoint(data);
					readChars(bufSentence, new ReadCharByType(Character.LETTER_NUMBER));

					int startIdx = startIdx(bufSentence);
					for(int i=0; i<bufSentence.length(); i++) {
						bufWord.add(new Word(new char[] {bufSentence.charAt(i)}, startIdx++, Word.TYPE_LETTER_NUMBER));
					}

					bufSentence.setLength(0);	//緩存的字符清除

					break;
				case Character.OTHER_NUMBER:
					//①⑩㈠㈩⒈⒑⒒⒛⑴⑽⑾⒇ 連着用
					bufSentence.appendCodePoint(data);
					readChars(bufSentence, new ReadCharByType(Character.OTHER_NUMBER));

					bufWord.add(createWord(bufSentence, Word.TYPE_OTHER_NUMBER));
					bufSentence.setLength(0);
					break;
				default :
					//其它認爲無效字符
					read = true;
				}//switch
			}
				
			// 中文分詞
			if(currentSentence != null) {
				do {
					Chunk chunk = seg.seg(currentSentence);
					for(int i=0; i<chunk.getCount(); i++) {
						bufWord.add(chunk.getWords()[i]);
					}
				} while (!currentSentence.isFinish());
				
				currentSentence = null;
			}
			
			word = bufWord.poll();
		}
		
		return word;
	}

主要是註釋了一些代碼，對字母、數字不要連續處理。

再次搜索字母查詢，效果如下：

綜上，這樣就簡單完成了數據庫中類似like和百分號雙向匹配需求。

alen1985

發佈了132 篇原創文章 · 獲贊 12 · 訪問量 68萬+

他的留言板關注

solr中mmseg4j支持單個字母、數字及組合搜索

杭州的 IT 崩盤了麼？

雲原生週刊：Kubernetes 十週年｜ 2024.6.11

開源高性能結構化日誌模塊NanoLog

Python 潮流週刊#55：分享 9 個高質量的技術類信息源！

WinForm應用實戰開發指南 - 表格數據錄入問題解析

Azure Virtual Network (22) 多訂閱使用Azure DNS解析問題 Windows Azure Platform 系列文章目錄

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

採用apache的commons-email包發送郵件死鎖

mongodb後臺啓動及遠程登錄

linux下導出mysql表數據到文本中

使用BeanUtils方法拷貝不上問題

配置zookeeper啓動內存

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結