前言
接上一篇數據採集之Web端上傳文件到Hadoop HDFS,總共需求有3個,這篇記錄如何通過Web端將MySQL表數據導入到HDFS中,主要是通過Sqoop2這個工具,之前已經寫了一篇 Sqoop2 從MySQL導入數據到Hadoop HDFS,不過那個是在命令行下操作的。
這回通過Java API的形式操作,其中還是有不少坑的。
環境
- OS Debian 8.7
- Hadoop 2.6.5
- SpringBoot 1.5.1.RELEASE
- MySQL 5.7.17 Community Server
- Sqoop 1.99.7
項目依賴
廢話不多說,直接先上pom.xml依賴文件。
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.infosys.sqoop</groupId>
<artifactId>sqoop</artifactId>
<version>1.0-SNAPSHOT</version>
<name>sqoop</name>
<packaging>jar</packaging>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>1.5.1.RELEASE</version>
<relativePath/> <!-- lookup parent from repository -->
</parent>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<hadoop.version>2.6.5</hadoop.version>
<sqoop.version>1.99.7</sqoop.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>javax.servlet-api</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.version}</version>
</dependency>
<dependency>
<groupId>org.apache.sqoop</groupId>
<artifactId>sqoop-client</artifactId>
<version>${sqoop.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.derby</groupId>
<artifactId>derby</artifactId>
<version>10.10.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- Test -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.mrunit</groupId>
<artifactId>mrunit</artifactId>
<version>1.1.0</version>
<classifier>hadoop2</classifier>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-minicluster</artifactId>
<version>${hadoop.version}</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<finalName>${project.artifactId}</finalName>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-archetype-plugin</artifactId>
<version>2.2</version>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
</plugins>
</build>
</project>
其中Springboot只是爲了快速搭建web框架,然後就是使用到了Hadoop Client和Sqoop相關的Jar包。需要注意的是其中的日誌框架,有好幾個slf4j的實現,我們排除掉幾個。
核心Sqoop
這裏只是演示,不是真實項目,所以就偷懶下,邏輯都寫一起了。真實項目中必須要抽取,畢竟不只是支持一種數據庫,而且一些配置選項是可以通過web端傳入的。
package com.infosys.sqoop.controller;
import com.infosys.sqoop.Driver;
import com.infosys.sqoop.ToFormat;
import org.apache.sqoop.client.SqoopClient;
import org.apache.sqoop.client.SubmissionCallback;
import org.apache.sqoop.model.*;
import org.apache.sqoop.validation.Status;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RestController;
import static com.infosys.sqoop.Constans.*;
/**
* 描述:
* 公司:infosys(印孚瑟斯技術有限公司)
* 作者:luhaoyuan <[email protected]>
* 版本:1.0
* 創建日期:2017/3/12.
*/
@RestController
public class SqoopController {
private static final Logger log = LoggerFactory.getLogger(SqoopController.class);
@PostMapping(value = "/mysql2HDFS")
public String mysqlToHDFS() throws Exception {
Driver.mkdir(new String[]{HDFS_OUTPUT});
SqoopClient client = new SqoopClient(SQOOP_URL);
configSourceLink(client);
configDestLink(client);
configJob(client);
startJob(client);
return "SUCCESS!";
}
private void startJob(SqoopClient client) throws InterruptedException {
// 啓動JOB並設置回調
MSubmission submission = client.startJob(JOB_NAME, DEFAULT_SUBMISSION_CALLBACKS, 100);
log.debug("Job Submission Status: " + submission.getStatus());
log.debug("Hadoop job id: " + submission.getExternalJobId());
log.debug("Job link: " + submission.getExternalLink());
if(submission.getStatus().isFailure()) {
log.error("Submission has failed: " + submission.getError().getErrorSummary());
log.error("Corresponding error details: " + submission.getError().getErrorDetails());
}
}
protected static final SubmissionCallback DEFAULT_SUBMISSION_CALLBACKS = new SubmissionCallback() {
@Override
public void submitted(MSubmission submission) {
log.info("Submission submitted: " + submission);
}
@Override
public void updated(MSubmission submission) {
log.info("Submission updated: " + submission);
}
@Override
public void finished(MSubmission submission) {
log.info("Submission finished: " + submission);
}
};
private void configJob(SqoopClient client) {
MJob job = client.createJob(FROM_LINK_NAME, TO_LINK_NAME);
job.setName(JOB_NAME);
// 設置源 link 配置
MFromConfig fromJobConfig = job.getFromJobConfig();
fromJobConfig.getStringInput("fromJobConfig.schemaName")
.setValue(SOURCE_DB);
fromJobConfig.getStringInput("fromJobConfig.tableName")
.setValue(SOURCE_TABLE);
fromJobConfig.getStringInput("fromJobConfig.partitionColumn")
.setValue("id");
// 設置目的 link 配置
MToConfig toJobConfig = job.getToJobConfig();
toJobConfig.getStringInput("toJobConfig.outputDirectory")
.setValue(HDFS_OUTPUT);
toJobConfig.getEnumInput("toJobConfig.outputFormat").setValue(ToFormat.TEXT_FILE);
client.saveJob(job);
}
/**
* 配置目的鏈接
*/
private void configDestLink(SqoopClient client) {
MLink link = client.createLink("hdfs-connector");
link.setName(TO_LINK_NAME);
MLinkConfig linkConfig = link.getConnectorLinkConfig();
linkConfig.getStringInput("linkConfig.confDir")
.setValue("/home/hadoop/hadoop/etc/hadoop");
Status status = client.saveLink(link);
if (status.canProceed()) {
log.debug("Created Link with Link Name : " + link.getName());
} else {
log.debug("Something went wrong creating the link");
}
}
/**
* 配置數據源鏈接
*/
private void configSourceLink(SqoopClient client) {
MLink link = client.createLink("generic-jdbc-connector");
link.setName(FROM_LINK_NAME);
MLinkConfig linkConfig = link.getConnectorLinkConfig();
// 配置link屬性
// 配置源連接
linkConfig.getStringInput("linkConfig.connectionString")
.setValue(DB_SCHEMA);
// 配置MySQL數據庫驅動
linkConfig.getStringInput("linkConfig.jdbcDriver")
.setValue("com.mysql.jdbc.Driver");
// 配置數據連接用戶和密碼
linkConfig.getStringInput("linkConfig.username")
.setValue(DB_USERNAME);
linkConfig.getStringInput("linkConfig.password")
.setValue(DB_PASSWD);
// mysql這裏地方必須要注意了,如果不設置會報SQL語法錯誤,
// 因爲默認是使用"包裹數據庫和表名的,這個mysql不支持
linkConfig.getStringInput("dialect.identifierEnclose")
.setValue("");
log.debug("source link conf = " + linkConfig.toString());
Status status = client.saveLink(link);
if (status.canProceed()) {
log.debug("Created Link with Link Name : " + link.getName());
} else {
log.debug("Something went wrong creating the link");
}
}
}
Hadoop操作
Sqoop導入到HDFS的輸出目錄必須爲空,否則會報錯。所以這裏可以操作Hadoop Client創建一個路徑。
package com.infosys.sqoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.IFile;
import org.apache.hadoop.util.GenericOptionsParser;
import java.io.IOException;
import static com.infosys.sqoop.Constans.FS_DEFAULT_FS;
import static com.infosys.sqoop.Constans.HDFS_HOST;
/**
* 描述:
* 公司:infosys(印孚瑟斯技術有限公司)
* 作者:luhaoyuan <[email protected]>
* 版本:1.0
* 創建日期:2017/3/12.
*/
public class Driver {
public static void mkdir(String[] args) throws IOException {
Configuration conf = new Configuration();
conf.set(FS_DEFAULT_FS, HDFS_HOST);
GenericOptionsParser optionsParser = new GenericOptionsParser(conf, args);
String[] remainingArgs = optionsParser.getRemainingArgs();
if (remainingArgs.length < 1) {
System.err.println("need a path");
System.exit(2);
}
Path path = new Path(args[0]);
FileSystem fs = FileSystem.get(conf);
if (!fs.exists(path)) {
fs.mkdirs(path);
}
}
}
常量
最後是一些常量配置,也貼出來吧~
package com.infosys.sqoop;
/**
* 描述:
* 公司:infosys(印孚瑟斯技術有限公司)
* 作者:luhaoyuan <[email protected]>
* 版本:1.0
* 創建日期:2017/3/12.
*/
public final class Constans {
public static final String FS_DEFAULT_FS = "fs.defaultFS";
public static final String HDFS_HOST = "hdfs://e5:9000";
public static final String SQOOP_URL = "http://192.168.1.2:12000/sqoop/";
// mysql db name
public static final String SOURCE_DB = "hadoopguide";
// table name
public static final String SOURCE_TABLE = "widgets";
public static final String HDFS_OUTPUT = "/sqoop";
// job name
public static final String JOB_NAME = "web-job";
// from link name
public static final String FROM_LINK_NAME = "web-link";
// to link name
public static final String TO_LINK_NAME = "web-hdfs";
// DB url
public static final String DB_SCHEMA = "jdbc:mysql://192.168.1.2:3306/hadoopguide?useSSL=false";
public static final String DB_USERNAME = "root";
public static final String DB_PASSWD = "lu123";
}
後記
爲了寫這個都花了大半天時間了,其中的坑真的是蠻多的。一些配置又沒有常量API提供,而且MapReduce Job又不好調試,這裏下面再記錄一下測試過程的一些錯誤。
java.lang.Throwable:
The statement was aborted because it would have caused a duplicate key
value in a unique or primary key constraint or unique index identified by
'FK_SQ_LNK_NAME_UNIQUE' defined on 'SQ_LINK'
這個我懷疑是沒設置數據庫字段爲Null時的處理方式導致的,反正我是把某個字段爲Null的設置了一個值,然後就沒報錯了。
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException:
You have an error in your SQL syntax; check the manual that corresponds
to your MySQL server version for the right syntax to use near
'"hadoopguide"."widgets"' at line 1
這個就是我上面說的邊界符問題,在MySQL中不支持,必須要在配置源鏈接的時候設置一下:
linkConfig.getStringInput("dialect.identifierEnclose")
.setValue("");
其他的數據庫就沒有測試了,如果有測試過的朋友可以說說哈~