Python文件處理練習--中英文分離

Python文件處理練習–中英文分離

問題：
對51voa上的中英文對照文本進行中英文分離
例如
華盛頓總統將感恩節定爲全國性節日
問題分析：
源文件是純文本，每一行以回車符分隔，是一箇中文或英文塊，其中中文文本僅包含中文標點，英文文本僅包含英文標點，所以可以通過檢查行中是否包括中、英文標點判斷是中文或英文塊，達到分離中英文的目的。

常規版本

#deal with voa

def dealVoa(orgtxts,Eng,Chn):
    fi = open(orgtxts,'r',encoding='utf-8')
    foe = open(Eng,'w')
    foc = open(Chn,'w')
    elist = ',.'
    clist = '，。'
    for line in fi:
        if len(line) > 1:
            if '，' in line or '。' in line or line.find(' ') == -1:
                foc.write(line)
            elif ',' in line or '.' in line or line.find(' '):
                foe.write(line)
    fi.close()    
    foe.close()
    foc.close()

orgtxts = 'weather'
Eng = orgtxts + '_e' + '.txt'
Chn = orgtxts + '_c' + '.txt'
orgtxts = 'weather' + '.txt'
dealVoa(orgtxts,Eng,Chn)

正則表達式版本

#deal with voa
#中英文文本的區別可以用中英文標點符號甄別，如果沒有標點符號用是否有空格區別
#
import re
re_cwords = re.compile(r'[，。]')
re_ewords = re.compile(r'[,. ]')

def dealVoa(orgtxts,Eng,Chn):
    fi = open(orgtxts,'r',encoding='utf-8')
    foe = open(Eng,'w')
    foc = open(Chn,'w')
    for line in fi:
        if len(line) > 1:#過濾空行
            if re_cwords.search(line):
                foc.write(line)
            elif re_ewords.search(line):
                foe.write(line)
            else:
                foc.write(line)#針對沒有標點的中文
    fi.close()    
    foe.close()
    foc.close()

orgtxts = 'weather'
Eng = orgtxts + '_e' + '.txt'
Chn = orgtxts + '_c' + '.txt'
orgtxts = 'weather' + '.txt'
dealVoa(orgtxts,Eng,Chn)

Mark：

現在原始文件是直接從網頁上手動抓取，下一步考慮，用腳本抓取。
對於如下同時包含中英文標點的句子無法正確處理。
“用戶將能夠自行決定。該系統現在在Weather.com網站以及Weather Channel智能手機應用程序上運行。”
12.3更新可以處理Item2，但對於沒有標點的行不能很好地處理。如“Better forecasts around worldwide”
12.4更新可以處理item3, 用find判斷是否有空格，但代碼有點冗餘，考慮用正則處理會更好。
12.5正則表達式版本，比較完美解決了item4代碼冗餘的問題。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python文件處理練習--中英文分離

Python文件處理練習–中英文分離

python文件處理練習12

Java中2數swap問題

一道Python題的解析

如何查詢Python保留字？

Python用切片實現循環移位

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結