python處理一段話，使他只存在英文，和數字

原創

Threeyearsago

2020-03-10 08:13

前段時間跑了了一個自然語言的程序，主要是用來處理一段話，使他只有英文和數字。

我自己寫了一個類，用來實現以下的功能

（1）去掉文字中的url

（2）去掉文字中所有的非英文的短語或者單詞。

（3）去掉文字中所有的符號，如!,@#(&$*等

（4）去掉文字中的所有的\n \t \r

（5）把文字全部變成小寫

（6）去掉文字中所有x00,x0z這樣的特殊的符號，這裏只能去掉x+數字着這樣開頭的文字

import re
#這個文件是專門處理str類型的文件，
#主要目的是去掉一段話裏的非英文的內容，去掉url，去掉特殊字符如\n,\t,\r,x00這樣的特殊字符
#還有就是去掉文字中的所有符號，
#把文字變成小寫。
class process_str:
    def get_english(self,dd):
        st = ""
        for k in dd.split():
            if len(re.findall("[^a-zA-Z\d.]", k)) == 0:
                st = st + " " + k
        return st

    def process_data(self,data) -> str:
        # 去掉url
        data_first = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%|-)*\b', '', data, flags=re.MULTILINE)
        # 去掉所有的符號，把大寫改爲小寫。
        data_second = data_first.replace(r"\n", " ").replace("?", ' ') \
            .replace("/", ' ').replace(",", ' ').replace("\\", ' '). \
            replace("~", ' ').replace("+", ' ').replace("=", ' ') \
            .replace("!", ' ').lower().replace("#", ' ').replace("@", ' ').replace(r"""""", '') \
            .replace("$", ' ').replace("%", ' ').replace("(", ' ').replace(r"\r", ' ') \
            .replace(")", ' ').replace("-", ' ').replace("_", '').replace(":", ' ') \
            .replace(";", ' ').replace("'", ' ').replace("{", ' ').replace("}", ' ') \
            .replace("[", ' ').replace("]", ' ').replace("|", ' ').replace("*", ' ') \
            .replace(">", ' ').replace("<", ' ').replace("$", " ").replace("^", ' ') \
            .replace(r"\t", ' ')
        # 去掉x0z這類的東西
        data_three = re.sub(r'x[0-9][a-zA-Z.\d]*', '', data_second, flags=re.MULTILINE)
        # 去掉非英文和數字的部分
        data_four = self.get_english(data_three).replace(".", " ")
        return data_four

代碼的py文件我放在了我的github上process_str.py中

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python處理一段話，使他只存在英文，和數字

關於csv的一些注意事項

matplotlib的一些注意事項

用python 創建文件夾的方法

python 怎麼讀取數個G的大的json文件的方法，memoryerror的解決辦法

python寫入txt 文件

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結