Spark SQL支持對Hive的讀寫操作。然而因爲Hive有很多依賴包,所以這些依賴包沒有包含在默認的Spark包裏面。如果Hive依賴的包能在classpath找到,Spark將會自動加載它們。需要注意的是,這些Hive依賴包必須複製到所有的工作節點上,因爲它們爲了能夠訪問存儲在Hive的數據,會調用Hive的序列化和反序列化(SerDes)包。Hive的配置文件hive-site.xml、core-site.xml(security配置)和hdfs-site.xml(HDFS配置)是保存在conf目錄下面。
當使用Hive時,必須初始化一個支持Hive的SparkSession,用戶即使沒有部署一個Hive的環境仍然可以使用Hive。當沒有配置hive-site.xml時,Spark會自動在當前應用目錄創建metastore_db和創建由spark.sql.warehouse.dir配置的目錄,如果沒有配置,默認是當前應用目錄下的spark-warehouse目錄。
注意:從Spark 2.0.0版本開始,hive-site.xml裏面的hive.metastore.warehouse.dir屬性已經被spark.sql.warehouse.dir替代,用於指定warehouse的默認數據路徑(必須有寫權限)。
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public static class Record implements Serializable {
private int key;
private String value;
public int getKey() {
return key;
}
public void setKey(int key) {
this.key = key;
}
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
}
// warehouseLocation points to the default location for managed databases and tables
String warehouseLocation = "/spark-warehouse";
// init spark session with hive support
SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example")
.master("local[*]")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate();
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)");
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src");
// Queries are expressed in HiveQL
spark.sql("SELECT * FROM src").show();
// +---+-------+
// |key| value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...
// only showing top 20 rows
// Aggregation queries are also supported.
spark.sql("SELECT COUNT(*) FROM src").show();
// +--------+
// |count(1)|
// +--------+
// | 500 |
// +--------+
// The results of SQL queries are themselves DataFrames and support all normal functions.
Dataset<Row> sqlDF = spark.sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key");
// The items in DaraFrames are of type Row, which lets you to access each column by ordinal.
Dataset<String> stringsDS = sqlDF.map(row -> "Key: " + row.get(0) + ", Value: " + row.get(1), Encoders.STRING());
stringsDS.show();
// +--------------------+
// | value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...
// You can also use DataFrames to create temporary views within a SparkSession.
List<Record> records = new ArrayList<Record>();
for (int key = 1; key < 100; key++) {
Record record = new Record();
record.setKey(key);
record.setValue("val_" + key);
records.add(record);
}
Dataset<Row> recordsDF = spark.createDataFrame(records, Record.class);
recordsDF.createOrReplaceTempView("records");
// Queries can then join DataFrames data with data stored in Hive.
spark.sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show();
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// | 2| val_2| 2| val_2|
// | 2| val_2| 2| val_2|
// | 4| val_4| 4| val_4|
// ...
// only showing top 20 rows
如果使用eclipse運行上述代碼的話需要添加spark-hive的jars,下面是maven的配置:
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.0</version>
</dependency>
否則的話會遇到下面錯誤:
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
at org.apache.spark.sql.SparkSession$Builder.enableHiveSupport(SparkSession.scala:815)
at JavaSparkHiveExample.main(JavaSparkHiveExample.java:17)
與不同版本Hive Metastore的交互
Spark SQL對Hive的支持其中一個最重要的部分是與Hive metastore的交互,使得Spark SQL可以訪問Hive表的元數據。從Spark 1.4.0版本開始,Spark SQL使用下面的配置可以用於查詢不同版本的Hive metastores。需要注意的是,本質上Spark SQL會使用編譯後的Hive 1.2.1版本的那些類來用於內部操作(serdes、UDFs、UDAFs等等)。