最近用Java寫了一個網頁數據採集的程序, 偶爾發現出現少量中文亂碼的現象. 後來才知道對於中文要用字符流來進行操作.
以下是採用字節流的代碼:
/** * Get the target URL's source<br/> * Use byte stream * * @param url * target URL * @return String: source * @throws Exception */ publicstaticString getUrlStr1(String url) throwsException { InputStream is = null; String strData = ""; try{ URL u = newURL(url); // Create URL is = u.openStream(); // Open the URL stream // Load the byte to the strData byte[] myByte = newbyte[1024* 4]; intlen = 0; while((len = is.read(myByte)) > 0) { String st = newString(myByte, 0, len); strData += st; } } catch(Exception e) { throwe; } finally{ is.close(); } returnstrData; }
下面是改進後的字符流代碼:
/** * Get the target URL's source<br/> * Use character stream * * @param url * target URL * @return String: source * @throws Exception */ publicstaticString getUrlStr(String url) throwsException { InputStream is = null; OutputStream os = newByteArrayOutputStream(); try{ URL u = newURL(url); // Create URL is = u.openStream(); // Open the URL stream // Load the byte to the strData byte[] myByte = newbyte[1024* 4]; intlen = 0; while((len = is.read(myByte)) > 0) { os.write(myByte, 0, len); } } catch(Exception e) { throwe; } finally{ is.close(); os.close(); } returnos.toString(); }
通過對比發現,由於在字節流時提前轉換爲字符串, 如果字節數組最後只存了中文字符的前半部分, 這相當於把一箇中文字符撕裂成兩半, 轉成String類型後就出現亂碼, 而且無法逆轉...