[Java Web]敏感詞過濾算法

1.DFA算法

DFA算法的原理可以參考 這裏 ,簡單來說就是通過Map構造出一顆敏感詞樹,樹的每一條由根節點到葉子節點的路徑構成一個敏感詞,例如下圖:

代碼簡單實現如下:

public class TextFilterUtil {
  //日誌
  private static final Logger LOG = LoggerFactory.getLogger(TextFilterUtil.class);
  //敏感詞庫
  private static HashMap sensitiveWordMap = null;
  //默認編碼格式
  private static final String ENCODING = "gbk";
  //敏感詞庫的路徑
  private static final InputStream in = TextFilterUtil.class.getClassLoader().getResourceAsStream("sensitive/keyWords.txt");
  /**
   * 初始化敏感詞庫
   */
  private static void init() {
    //讀取文件
    Set<String> keyWords = readSensitiveWords();
    //創建敏感詞庫
    sensitiveWordMap = new HashMap<>(keyWords.size());
    for (String keyWord : keyWords) {
      createKeyWord(keyWord);
    }
  }
  /**
   * 構建敏感詞庫
   *
   * @param keyWord
   */
  private static void createKeyWord(String keyWord) {
    if (sensitiveWordMap == null) {
      LOG.error("sensitiveWordMap 未初始化!");
      return;
    }
    Map nowMap = sensitiveWordMap;
    for (Character c : keyWord.toCharArray()) {
      Object obj = nowMap.get(c);
      if (obj == null) {
        Map<String, Object> childMap = new HashMap<>();
        childMap.put("isEnd", "false");
        nowMap.put(c, childMap);
        nowMap = childMap;
      } else {
        nowMap = (Map) obj;
      }
    }
    nowMap.put("isEnd", "true");
  }
  /**
   * 讀取敏感詞文件
   *
   * @return
   */
  private static Set<String> readSensitiveWords() {
    Set<String> keyWords = new HashSet<>();
    BufferedReader reader = null;
    try {
      reader = new BufferedReader(new InputStreamReader(in, ENCODING));
      String line;
      while ((line = reader.readLine()) != null) {
        keyWords.add(line.trim());
      }
    } catch (UnsupportedEncodingException e) {
      LOG.error("敏感詞庫文件轉碼失敗!");
    } catch (FileNotFoundException e) {
      LOG.error("敏感詞庫文件不存在!");
    } catch (IOException e) {
      LOG.error("敏感詞庫文件讀取失敗!");
    } finally {
      if (reader != null) {
        try {
          reader.close();
        } catch (IOException e) {
          e.printStackTrace();
        }
        reader = null;
      }
    }
    return keyWords;
  }
  /**
   * 檢查敏感詞
   *
   * @return
   */
  private static List<String> checkSensitiveWord(String text) {
    if (sensitiveWordMap == null) {
      init();
    }
    List<String> sensitiveWords = new ArrayList<>();
    Map nowMap = sensitiveWordMap;
    for (int i = 0; i < text.length(); i++) {
      Character word = text.charAt(i);
      Object obj = nowMap.get(word);
      if (obj == null) {
        continue;
      }
      int j = i + 1;
      Map childMap = (Map) obj;
      while (j < text.length()) {
        if ("true".equals(childMap.get("isEnd"))) {
          sensitiveWords.add(text.substring(i, j));
        }
        obj = childMap.get(text.charAt(j));
        if (obj != null) {
          childMap = (Map) obj;
        } else {
          break;
        }
        j++;
      }
    }
    return sensitiveWords;
  }
}

2.TTMP算法

TTMP算法由網友原創,關於它的起源可以查看 這裏 ,TTMP算法的原理是將敏感詞拆分成“髒字”的序列,只有待比對字符串完全由“髒字”組成時,纔去判斷它是否爲敏感詞,減少了比對次數。這個算法的簡單實現如下:

public class TextFilterUtil {
  //日誌
  private static final Logger LOG = LoggerFactory.getLogger(TextFilterUtil.class);
  //默認編碼格式
  private static final String ENCODING = "gbk";
  //敏感詞庫的路徑
  private static final InputStream in = TextFilterUtil.class.getClassLoader().getResourceAsStream("sensitive/keyWords.txt");
  //髒字庫
  private static Set<Character> sensitiveCharSet = null;
  //敏感詞庫
  private static Set<String> sensitiveWordSet = null;
  /**
   * 初始化敏感詞庫
   */
  private static void init() {
    //初始化容器
    sensitiveCharSet = new HashSet<>();
    sensitiveWordSet = new HashSet<>();
    //讀取文件 創建敏感詞庫
    readSensitiveWords();
  }
  /**
   * 讀取本地的敏感詞文件
   *
   * @return
   */
  private static void readSensitiveWords() {
    BufferedReader reader = null;
    try {
      reader = new BufferedReader(new InputStreamReader(in, ENCODING));
      String line;
      while ((line = reader.readLine()) != null) {
        String word = line.trim();
        sensitiveWordSet.add(word);
        for (Character c : word.toCharArray()) {
          sensitiveCharSet.add(c);
        }
      }
    } catch (UnsupportedEncodingException e) {
      LOG.error("敏感詞庫文件轉碼失敗!");
    } catch (FileNotFoundException e) {
      LOG.error("敏感詞庫文件不存在!");
    } catch (IOException e) {
      LOG.error("敏感詞庫文件讀取失敗!");
    } finally {
      if (reader != null) {
        try {
          reader.close();
        } catch (IOException e) {
          e.printStackTrace();
        }
        reader = null;
      }
    }
    return;
  }
  /**
   * 檢查敏感詞
   *
   * @return
   */
  private static List<String> checkSensitiveWord(String text) {
    if (sensitiveWordSet == null || sensitiveCharSet == null) {
      init();
    }
    List<String> sensitiveWords = new ArrayList<>();
    for (int i = 0; i < text.length(); i++) {
      Character word = text.charAt(i);
      if (!sensitiveCharSet.contains(word)) {
        continue;
      }
      int j = i;
      while (j < text.length()) {
        if (!sensitiveCharSet.contains(word)) {
          break;
        }
        String key = text.substring(i, j + 1);
        if (sensitiveWordSet.contains(key)) {
          sensitiveWords.add(key);
        }
        j++;
      }
    }
    return sensitiveWords;
  }
}

注:以上代碼實現僅用於展示思路,在實際使用中還有很多地方可以優化。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章