1.DFA算法
DFA算法的原理可以參考 這裏 ,簡單來說就是通過Map構造出一顆敏感詞樹,樹的每一條由根節點到葉子節點的路徑構成一個敏感詞,例如下圖:
代碼簡單實現如下:
public class TextFilterUtil {
//日誌
private static final Logger LOG = LoggerFactory.getLogger(TextFilterUtil.class);
//敏感詞庫
private static HashMap sensitiveWordMap = null;
//默認編碼格式
private static final String ENCODING = "gbk";
//敏感詞庫的路徑
private static final InputStream in = TextFilterUtil.class.getClassLoader().getResourceAsStream("sensitive/keyWords.txt");
/**
* 初始化敏感詞庫
*/
private static void init() {
//讀取文件
Set<String> keyWords = readSensitiveWords();
//創建敏感詞庫
sensitiveWordMap = new HashMap<>(keyWords.size());
for (String keyWord : keyWords) {
createKeyWord(keyWord);
}
}
/**
* 構建敏感詞庫
*
* @param keyWord
*/
private static void createKeyWord(String keyWord) {
if (sensitiveWordMap == null) {
LOG.error("sensitiveWordMap 未初始化!");
return;
}
Map nowMap = sensitiveWordMap;
for (Character c : keyWord.toCharArray()) {
Object obj = nowMap.get(c);
if (obj == null) {
Map<String, Object> childMap = new HashMap<>();
childMap.put("isEnd", "false");
nowMap.put(c, childMap);
nowMap = childMap;
} else {
nowMap = (Map) obj;
}
}
nowMap.put("isEnd", "true");
}
/**
* 讀取敏感詞文件
*
* @return
*/
private static Set<String> readSensitiveWords() {
Set<String> keyWords = new HashSet<>();
BufferedReader reader = null;
try {
reader = new BufferedReader(new InputStreamReader(in, ENCODING));
String line;
while ((line = reader.readLine()) != null) {
keyWords.add(line.trim());
}
} catch (UnsupportedEncodingException e) {
LOG.error("敏感詞庫文件轉碼失敗!");
} catch (FileNotFoundException e) {
LOG.error("敏感詞庫文件不存在!");
} catch (IOException e) {
LOG.error("敏感詞庫文件讀取失敗!");
} finally {
if (reader != null) {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
reader = null;
}
}
return keyWords;
}
/**
* 檢查敏感詞
*
* @return
*/
private static List<String> checkSensitiveWord(String text) {
if (sensitiveWordMap == null) {
init();
}
List<String> sensitiveWords = new ArrayList<>();
Map nowMap = sensitiveWordMap;
for (int i = 0; i < text.length(); i++) {
Character word = text.charAt(i);
Object obj = nowMap.get(word);
if (obj == null) {
continue;
}
int j = i + 1;
Map childMap = (Map) obj;
while (j < text.length()) {
if ("true".equals(childMap.get("isEnd"))) {
sensitiveWords.add(text.substring(i, j));
}
obj = childMap.get(text.charAt(j));
if (obj != null) {
childMap = (Map) obj;
} else {
break;
}
j++;
}
}
return sensitiveWords;
}
}
2.TTMP算法
TTMP算法由網友原創,關於它的起源可以查看 這裏 ,TTMP算法的原理是將敏感詞拆分成“髒字”的序列,只有待比對字符串完全由“髒字”組成時,纔去判斷它是否爲敏感詞,減少了比對次數。這個算法的簡單實現如下:
public class TextFilterUtil {
//日誌
private static final Logger LOG = LoggerFactory.getLogger(TextFilterUtil.class);
//默認編碼格式
private static final String ENCODING = "gbk";
//敏感詞庫的路徑
private static final InputStream in = TextFilterUtil.class.getClassLoader().getResourceAsStream("sensitive/keyWords.txt");
//髒字庫
private static Set<Character> sensitiveCharSet = null;
//敏感詞庫
private static Set<String> sensitiveWordSet = null;
/**
* 初始化敏感詞庫
*/
private static void init() {
//初始化容器
sensitiveCharSet = new HashSet<>();
sensitiveWordSet = new HashSet<>();
//讀取文件 創建敏感詞庫
readSensitiveWords();
}
/**
* 讀取本地的敏感詞文件
*
* @return
*/
private static void readSensitiveWords() {
BufferedReader reader = null;
try {
reader = new BufferedReader(new InputStreamReader(in, ENCODING));
String line;
while ((line = reader.readLine()) != null) {
String word = line.trim();
sensitiveWordSet.add(word);
for (Character c : word.toCharArray()) {
sensitiveCharSet.add(c);
}
}
} catch (UnsupportedEncodingException e) {
LOG.error("敏感詞庫文件轉碼失敗!");
} catch (FileNotFoundException e) {
LOG.error("敏感詞庫文件不存在!");
} catch (IOException e) {
LOG.error("敏感詞庫文件讀取失敗!");
} finally {
if (reader != null) {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
reader = null;
}
}
return;
}
/**
* 檢查敏感詞
*
* @return
*/
private static List<String> checkSensitiveWord(String text) {
if (sensitiveWordSet == null || sensitiveCharSet == null) {
init();
}
List<String> sensitiveWords = new ArrayList<>();
for (int i = 0; i < text.length(); i++) {
Character word = text.charAt(i);
if (!sensitiveCharSet.contains(word)) {
continue;
}
int j = i;
while (j < text.length()) {
if (!sensitiveCharSet.contains(word)) {
break;
}
String key = text.substring(i, j + 1);
if (sensitiveWordSet.contains(key)) {
sensitiveWords.add(key);
}
j++;
}
}
return sensitiveWords;
}
}
注:以上代碼實現僅用於展示思路,在實際使用中還有很多地方可以優化。