樓主喜歡用Java應對各種小需求,以此提高工作效率。
客戶在集羣上提供了一份.sql文件,有2個多G,用vim等編輯器打不開,只能less一部分,而且內容有亂碼(中文部分,也不清楚該份文件的編碼格式)——改一下vim的字符集配置就可以解決。
下載文件到本地,嘗試用notepad++打開,提示“File is to be opened by Notepate++”;用MySQL Workbench打開,出現卡死。
使用文件分割器,對其進行分割。把文件拆分成15等份,每份150MB。
通過less可以看到建表語句,爲oracle,改成postgresql版,並建好表。打開1.zg,把insert into之外的語句刪掉後,用Navicat for PostgreSQL工具運行sql,出現字符集錯誤。
以下爲數據樣式示例:
insert into TB_IMSI_PARAM (NUM, IMSI, SIMNO, COST, PWD, MONTHFEE, EXPDATE, REMARK, CALLFEE, CALLEDFEE, GROUP_ID, GROUP_SHORT_NUM, SHORTNUM_FEE, CONF_FEE, IN_DATE, CARD_TYPE, MOD_DATE, IS_LONG_CARD, SMS_MOFEE, PUKCODE, ACTIVECODE, WIFI_MOFEE, AGENTCODE, SALE_PRICE, UPDATE_TIME, PRODUCT_ID)
values ('GD005', '460018172970051', '8986010712769238551', 2400, '123456', 0, 7, '皇崗集散中心(動)08.01.31-30', 2500, 2501, 57440, '1012', null, null, null, 30, to_date('27-06-2008 10:38:50', 'dd-mm-yyyy hh24:mi:ss'), null, 2013, null, null, 27, 'AYaGD005', null, null, 32);
insert into TB_IMSI_PARAM (NUM, IMSI, SIMNO, COST, PWD, MONTHFEE, EXPDATE, REMARK, CALLFEE, CALLEDFEE, GROUP_ID, GROUP_SHORT_NUM, SHORTNUM_FEE, CONF_FEE, IN_DATE, CARD_TYPE, MOD_DATE, IS_LONG_CARD, SMS_MOFEE, PUKCODE, ACTIVECODE, WIFI_MOFEE, AGENTCODE, SALE_PRICE, UPDATE_TIME, PRODUCT_ID)
values ('GD005', '460018172966280', '8986010712769234780', 2400, '123456', 0, 7, '集散中心(動)08.1.9-30', 2500, 2501, 57420, '0034', 0, 0, null, 30, to_date('27-06-2008 10:38:50', 'dd-mm-yyyy hh24:mi:ss'), null, 2013, null, null, 27, 'AYaGD005', null, null, 32);
我們可以發現文件中攜帶to_date('27-06-2008 10:38:50', 'dd-mm-yyyy hh24:mi:ss')
函數,該函數在mysql中沒找到,但是postgresql有,所以我們沒有必要花大幅功力去切割這個函數。
在實踐中,想要對每個文件進行insert into table values(...),values(...)....
最後證實在values中有嵌入函數時,是不能採用該策略的,所以該部分代碼不貼。
同時,大家都知道,用oracle工具導出文件時,會有如下:
commit;
prompt 10000 records committed…
commit;
prompt 20000 records committed…
……
這些語句要記得處理。切割並不能保證每份文件都是完整的,所以sql語句的不完整只會在頭和尾。有如下幾種情況:
1.zg 尾:insert into TB_IMSI_PARAM (NUM, IMSI, SIMNO, COST, PWD, MONTHFEE, EXPDATE, REMARK, CALLFEE, CALLEDFEE, GROUP_ID, GROUP_SHORT_NUM,
2.zg 頭:SHORTNUM_FEE, CONF_FEE, IN_DATE, CARD_TYPE, MOD_DATE, IS_LONG_CARD, SMS_MOFEE, PUKCODE, ACTIVECODE, WIFI_MOFEE, AGENTCODE, SALE_PRICE, UPDATE_TIME, PRODUCT_ID)
values (‘GD005’, ‘460018172966280’, ‘8986010712769234780’, 2400, ‘123456’, 0, 7, ‘集散中心(動)08.1.9-30’, 2500, 2501, 57420, ‘0034’, 0, 0, null, 30, to_date(‘27-06-2008 10:38:50’, ‘dd-mm-yyyy hh24:mi:ss’), null, 2013, null, null, 27, ‘AYaGD005’, null, null, 32);
1.zg 尾:values (‘GD005’, ‘460018172966280’, ‘8986010712769234780’, 2400, ‘123456’, 0, 7, ‘集散
2.zg 頭:中心(動)08.1.9-30’, 2500, 2501, 57420, ‘0034’, 0, 0, null, 30, to_date(‘27-06-2008 10:38:50’, ‘dd-mm-yyyy hh24:mi:ss’), null, 2013, null, null, 27, ‘AYaGD005’, null, null, 32);
1.zg 尾:insert into TB_IMSI_PARAM (NUM, IMSI, SIMNO, COST, PWD, MONTHFEE, EXPDATE, REMARK, CALLFEE, CALLEDFEE, GROUP_ID, GROUP_SHORT_NUM, SHORTNUM_FEE, CONF_FEE, IN_DATE, CARD_TYPE, MOD_DATE, IS_LONG_CARD, SMS_MOFEE, PUKCODE, ACTIVECODE, WIFI_MOFEE, AGENTCODE, SALE_PRICE, UPDATE_TIME, PRODUCT_ID)
2.zg 頭:values (‘GD005’, ‘460018172966280’, ‘8986010712769234780’, 2400, ‘123456’, 0, 7, ‘集散中心(動)08.1.9-30’, 2500, 2501, 57420, ‘0034’, 0, 0, null, 30, to_date(‘27-06-2008 10:38:50’, ‘dd-mm-yyyy hh24:mi:ss’), null, 2013, null, null, 27, ‘AYaGD005’, null, null, 32);
編碼如下:
package com.sibat.uhuibao;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.util.Collections;
import java.util.List;
import com.zh.zsr.FilePath;
/**
*
* @author nanphonfy
*/
public class BigSQLFinal {
public static void main(String[] args) throws IOException {
String readFile = "C:\\Users\\sibat\\Desktop\\1";//把15份分割文件放在該目錄
String writeFile = "C:\\Users\\sibat\\Desktop\\2\\";//把處理後的文件放在該目錄
String errorFile = "C:\\Users\\sibat\\Desktop\\3\\";//把處理的錯誤文件放在這邊,即error.sql
String line = "";
FilePath fp = new FilePath();
List<String> readPath = fp.getFiles(readFile);
Collections.sort(readPath);
for (String p : readPath) {
System.out.println(p);
}
FileInputStream fis = null;
InputStreamReader isw = null;
BufferedReader br = null;// 把filewriter的寫法寫成FileOutputStream形式
int count = 0;
FileOutputStream efos = new FileOutputStream(errorFile + "error.sql");
OutputStreamWriter eosw = new OutputStreamWriter(efos, "UTF-8");
BufferedWriter ebw = new BufferedWriter(eosw);// 把filewriter的寫法寫成FileOutputStream形式
for (String path : readPath) {
String arr[] = path.split("\\\\");// 爲了得到文件名
int length = arr.length;
fis = new FileInputStream(path);
isw = new InputStreamReader(fis, "GBK");//客戶給的文件是GBK的
br = new BufferedReader(isw);// 把filewriter的寫法寫成FileOutputStream形式
String name = arr[length - 1].replace(".zg", "") + ".sql";
FileOutputStream fos = new FileOutputStream(writeFile + name);
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");//數據庫是設爲UTF-8的,所以寫入的時候要從GBK轉碼
BufferedWriter bw = new BufferedWriter(osw);// 把filewriter的寫法寫成FileOutputStream形式
long a = System.currentTimeMillis();
int num = 0;// 第一行
boolean flag = false;// 用來標記第一行是否完整,如果不完整,當第二行爲values時,要存入錯誤文件
while ((line = br.readLine()) != null) {
if (line.isEmpty())
continue;
if (num == 1) {
if (flag == true) {
if (line.contains("insert into TB_IMSI_PARAM (") && line.contains("PRODUCT_ID")) {
bw.write(line);
bw.newLine();
bw.flush();
num++;
continue;
} else {
ebw.write(line);
ebw.newLine();
ebw.flush();
flag = false;
num++;
continue;
}
}
}
if (num == 0) {
if (line.contains("values (") && line.contains(");")) {
ebw.write(line);
ebw.newLine();
ebw.flush();
num++;
continue;
} else if (line.contains("insert into TB_IMSI_PARAM (") && line.contains("PRODUCT_ID")) {
bw.write(line);
bw.newLine();
bw.flush();
num++;
} else {// 包括殘缺,所以第二行可能爲insert
ebw.write(line);
ebw.newLine();
ebw.flush();
flag = true;
num++;
}
} else if ((line.contains("insert into TB_IMSI_PARAM (") && line.contains("PRODUCT_ID"))
|| (line.contains("values (") && line.contains(");"))) {
bw.write(line);
bw.newLine();
bw.flush();
num++;
} else {
if (line.contains("commit;") || line.contains("records committed..."))
continue;
ebw.write(line);
ebw.newLine();
ebw.flush();
num++;
// System.out.println(line);
}
}
ebw.write("=====" + path + "=====\n\n");
long b = System.currentTimeMillis();
System.out.println(name + "文件耗時:" + (b - a) + "\n");
}
}
}
處理完後,可以用gvim 對檢查各個文件的頭尾。確保無誤之後,就可以運行sql了。因爲在Navicat for PostgreSQL工具不支持批量運行sql,每次都要運行完再運行下一個,效率不高。如何批量運行,代碼如下:
package com.sibat.uhuibao;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.sql.Connection;
import java.sql.Statement;
import java.util.LinkedList;
import java.util.List;
import com.sibat.uhuibao.util.DBUtil;
import com.zh.zsr.FilePath;
/**
* 讀取 SQL 腳本並執行
*
* @author nanphonfy
*/
public class SqlFileExecutor {
/**
* 傳入連接來執行 SQL 腳本文件
*
* @param conn
* 傳入數據庫連接
* @param sqlFile
* SQL 腳本文件
* @throws Exception
*/
public void execute(Connection conn, String sqlFile) throws Exception {
FileInputStream fis = null;
InputStreamReader isw = null;
BufferedReader br = null;// 把filewriter的寫法寫成FileOutputStream形式
fis = new FileInputStream(sqlFile);
isw = new InputStreamReader(fis, "UTF-8");
br = new BufferedReader(isw);// 把filewriter的寫法寫成FileOutputStream形式
String line = null;
long a = System.currentTimeMillis();
int num = 0;
String tmp = null;
List<String> sqlList = new LinkedList<>();//因爲路徑是存放在這裏面的,所以內存會被撐爆
while ((line = br.readLine()) != null) {
num++;
if (num == 1) {
tmp = line;
}
if (num == 2) {
num = 0;
tmp = tmp.concat(line);
sqlList.add(tmp);
}
}
Statement stmt = null;
stmt = conn.createStatement();
for (String sql : sqlList) {
stmt.addBatch(sql);
}
stmt.executeBatch();
System.out.println(sqlFile + "執行成功!!!!");
}
public static void main(String[] args) throws Exception {
// List<String> sqlList = new SqlFileExecutor().loadSql(args[0]);
// System.out.println("size:" + sqlList.size());
// for (String sql : sqlList) {
// System.out.println(sql);
// }
SqlFileExecutor executor = new SqlFileExecutor();
String readFile = "C:\\Users\\sibat\\Desktop\\整理:final";//確保可以執行成功的文件
FilePath fp = new FilePath();
List<String> readPath = fp.getFiles(readFile);
Connection conn = DBUtil.getConnection();
for (String path : readPath) {
executor.execute(conn, path);
}
}
}
FilesUtil.java
package com.sibat.uhuibao.util;
import java.io.File;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
/**
* JAVA遍歷一個文件夾中的所有文件
*
* @author nanphonfy
* @time 2016年8月23日 下午3:34:18
*/
public class FilesUtil {
private List<String> absolutePaths = new LinkedList<>();
/*
* 通過遞歸得到某一路徑下所有的目錄及其文件
*/
public List<String> getFiles(String filePath) {
File root = new File(filePath);
File[] files = root.listFiles();
for (File file : files) {
if (file.isDirectory()) {
getFiles(file.getAbsolutePath());
} else {
if (!file.getAbsolutePath().toString().contains("_SUCCESS"))
absolutePaths.add(file.getAbsolutePath().toString());
}
}
return absolutePaths;
}
}
DBUtil.java
package com.sibat.uhuibao.util;
import java.sql.SQLException;
import javax.sql.DataSource;
import com.mchange.v2.c3p0.ComboPooledDataSource;
/**
*
* @author nanphonfy
*/
public class DBUtil {
private static DataSource dataSource = null;// 數據源一份就可以了,所以用static
static {
// 數據源只能被創建一次
dataSource = new ComboPooledDataSource("XXX");
}
/**
* 返回一個數據源的connection對象
*
* @return
* @throws SQLException
*/
public static java.sql.Connection getConnection() throws SQLException {// 這裏要轉爲這種類型
return dataSource.getConnection();
}
/**
* 釋放連接
*
* @param connection
*/
public static void releaseConnection(java.sql.Connection connection) {
try {
if (connection != null) {
connection.close();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
c3p0-config.xml
<?xml version="1.0" encoding="UTF-8"?>
<c3p0-config>
<named-config name="XXX">
<property name="user">XXX</property>
<property name="password">XXX</property>
<property name="driverClass">org.postgresql.Driver</property>
<property name="jdbcUrl">
jdbc:postgresql://localhost:5432/數據庫名
</property>
<property name="acquireIncrement">5</property>
<property name="initialPoolSize">10</property>
<property name="minPoolSize">10</property>
<property name="maxPoolSize">50</property>
<property name="checkoutTimeout">0</property>
<property name="maxStatements">20</property>
<property name="maxStatementsPerConnection">5</property>
<!--每60秒檢查所有連接池中的空閒連接。Default: 0 -->
<property name="idleConnectionTestPeriod">60</property>
</named-config>
</c3p0-config>
因爲語句數太多了,所以內存可能會被撐爆。
跳出這樣的錯誤:Error java.lang.OutOfMemoryError: GC overhead limit exceeded
解決方法如下:
Just increase the heap size a little by setting this option in
Run → Run Configurations → Arguments → VM arguments
-Xms3072M -Xmx4096M
Xms - for minimum limit
Xmx - for maximum limit
15份文件可以分3次執行完。
最後再執行error.sql文件。
然後,通過postgresql導出sql文件,數據格式變成這樣:
INSERT INTO "public"."tb_imsi_param" VALUES ('1', null, '460018172943802', '8986010512769124302', '1000', '123456', '0', '10', null, '2500', '2501', null, null, '0', '0', null, '30', '2008-06-27', '0', '2013', null, null, '27', 'AYaOTHERS', null, null, '32');
INSERT INTO "public"."tb_imsi_param" VALUES ('2', 'GD005', '460018172966280', '8986010712769234780', '2400', '123456', '0', '7', '集散中心(動)08.1.9-30', '2500', '2501', '57420', '0034', '0', '0', null, '30', '2008-06-27', null, '2013', null, null, '27', 'AYaGD005', null, null, '32');
我們發現,postgresql導出後就沒有to_date函數了,再寫個程序,只留下數據,最終提交給數據分析人員,用Apache Pig分析。(很簡單,就不貼出來了)
樣例:
'8', 'GD005', '460018172969895', '8986010712769238395', '2400', '123456', '0', '7', '皇崗集散中心(動)08.01.31-30', '2500', '2501', '57440', '1036', null, null, null, '30', '2008-06-27', null, '2013', null, null, '27', 'AYaGD005', null, null, '32'
以下爲本文小結:
關於處理大文件imsi_param2016.sql:
該文件有2個多G,存放的內容是客戶oracle數據庫中的某張表數據。用Linux的vim等編輯器不能打開,只能less一小部分。
①下載到本地;
②使用notepad++、MySQL Workbench等工具無法打開如此大的文件;
③使用postgresql運行.sql,出現字符集編碼不一致導入失敗的問題;
④使用“橘子分割”器,把文件拆分成15等份,每份150MB,再通過Java程序對文件進行處理,過濾得到完整的sql。把oracle錶轉換爲postgresql版(因爲to_date函數mysql沒有);
⑤再分別對15個sql文件,轉碼成UTF-8;
⑥用程序,把每個文件不完整的sql抽出,並整合成一份error.sql,再人工調整格式;
⑦15分sql文件,逐一運行,確實麻煩,寫程序自動運行,一次搞定;
⑧經過如上處理,oracle可轉爲postgresql且錯誤率0%。
作者: @nanphonfy
Email: nanphonfy (Nfzone) gmail.com 請將(Nfzone)換成@