文章目錄
一、前言
Spark的發展史可以簡單概括爲三個階段,分別爲:RDD、DataFrame 和DataSet。在Spark 2.0之前,使用Spark必須先創建SparkConf和SparkContext,不過在Spark 2.0中只要創建一個SparkSession就可以了,SparkConf、SparkContext和SQLContext都已經被封裝在SparkSession當中,它是Spark的一個全新切入點,大大降低了Spark的學習難度。
二、創建SparkSession
創建SparkSession的方式非常簡單,如下:
//創建SparkSession
val spark = SparkSession.builder()
.master("local[*]")
.appName("dataset")
.enableHiveSupport() //支持hive,如果代碼中用不到hive的話,可以省略這一條
.getOrCreate()
三、DataSet/DataFrame的創建
1、序列創建 DataSet
//1、產生序列dataset
val numDS = spark.range(5, 100, 5)
numDS.orderBy(desc("id")).show(5) //降序排序,顯示5個
numDS.describe().show() //打印numDS的摘要
結果如下所示:
+---+
| id|
+---+
| 95|
| 90|
| 85|
| 80|
| 75|
+---+
only showing top 5 rows
+-------+------------------+
|summary| id|
+-------+------------------+
| count| 19|
| mean| 50.0|
| stddev|28.136571693556885|
| min| 5|
| max| 95|
+-------+------------------+
2、集合創建 DataSet
首先創建幾個可能用到的樣例類:
//樣例類
case class Person(name: String, age: Int, height: Int)
case class People(age: Int, names: String)
case class Score(name: String, grade: Int)
然後定義隱式轉換:
import spark.implicits._
最後,定義集合,創建 DataSet
//2、集合轉成dataset
val seq1 = Seq(Person("xzw", 24, 183), Person("yxy", 24, 178), Person("lzq", 25, 168))
val ds1 = spark.createDataset(seq1)
ds1.show()
結果如下所示:
+----+---+------+
|name|age|height|
+----+---+------+
| xzw| 24| 183|
| yxy| 24| 178|
| lzq| 25| 168|
+----+---+------+
3、RDD 轉成 DataFrame
//3、RDD轉成DataFrame
val array1 = Array((33, 24, 183), (33, 24, 178), (33, 25, 168))
val rdd1 = spark.sparkContext.parallelize(array1, 3).map(f => Row(f._1, f._2, f._3))
val schema = StructType(
StructField("a", IntegerType, false) ::
StructField("b", IntegerType, true) :: Nil
)
val rddToDataFrame = spark.createDataFrame(rdd1, schema)
rddToDataFrame.show(false)
結果如下所示:
+---+---+
|a |b |
+---+---+
|33 |24 |
|33 |24 |
|33 |25 |
+---+---+
4、讀取文件
//4、讀取文件,這裏以csv文件爲例
val ds2 = spark.read.csv("C://Users//Machenike//Desktop//xzw//test.csv")
ds2.show()
結果如下所示:
+---+---+----+
|_c0|_c1| _c2|
+---+---+----+
|xzw| 24| 183|
|yxy| 24| 178|
|lzq| 25| 168|
+---+---+----+
5、讀取文件,並配置詳細參數
//5、讀取文件,並配置詳細參數
val ds3 = spark.read.options(Map(("delimiter", ","), ("header", "false")))
.csv("C://Users//Machenike//Desktop//xzw//test.csv")
ds3.show()
結果如下圖所示:
+---+---+----+
|_c0|_c1| _c2|
+---+---+----+
|xzw| 24| 183|
|yxy| 24| 178|
|lzq| 25| 168|
+---+---+----+
四、DataSet 基礎函數
//1、DataSet存儲類型
val seq1 = Seq(Person("xzw", 24, 183), Person("yxy", 24, 178), Person("lzq", 25, 168))
val ds1 = spark.createDataset(seq1)
ds1.show()
ds1.checkpoint()
ds1.cache()
ds1.persist()
ds1.count()
ds1.unpersist(true)
//2、DataSet結構屬性
ds1.columns
ds1.dtypes
ds1.explain()
//3、DataSet rdd數據互換
val rdd1 = ds1.rdd
val ds2 = rdd1.toDS()
ds2.show()
val df2 = rdd1.toDF()
df2.show()
//4、保存文件
df2.select("name", "age", "height").write.format("csv").save("./save")
五、DataSet 的 Actions 操作
六、DataSet 的轉化操作
package sparkml
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
//樣例類
case class Person(name: String, age: Int, height: Int)
case class People(age: Int, names: String)
case class Score(name: String, grade: Int)
object WordCount2 {
def main(args: Array[String]): Unit = {
//設置日誌輸出格式
Logger.getLogger("org").setLevel(Level.WARN)
//創建SparkSession
val spark = SparkSession.builder()
.master("local[*]")
.appName("dataset")
.getOrCreate()
import spark.implicits._
//seq創建dataset
val seq1 = Seq(Person("leo", 29, 170), Person("jack", 21, 170), Person("xzw", 21, 183))
val ds1 = spark.createDataset(seq1)
//1、map操作,flatmap操作
ds1.map{x => (x.age + 1, x.name)}.show()
ds1.flatMap{x =>
val a = x.age
val s = x.name.split("").map{x => (a, x)}
s
}.show()
//2、filter操作,where操作
ds1.filter("age >= 25 and height >= 170").show()
ds1.filter($"age" >= 25 && $"height" >= 170).show()
ds1.filter{x => x.age >= 25 && x.height >= 170}.show()
ds1.where("age >= 25 and height >= 170").show()
ds1.where($"age" >= 25 && $"height" >= 170).show()
//3、去重操作
ds1.distinct().show()
ds1.dropDuplicates("age").show()
ds1.dropDuplicates("age", "height").show()
ds1.dropDuplicates(Seq("age", "height")).show()
ds1.dropDuplicates(Array("age", "height")).show()
//4、加法減法操作
val seq2 = Seq(Person("leo", 18, 183), Person("jack", 18, 175), Person("xzw", 22, 183), Person("lzq", 23, 175))
val ds2 = spark.createDataset(seq2)
val seq3 = Seq(Person("leo", 19, 183), Person("jack", 18, 175), Person("xzw", 22, 170), Person("lzq", 23, 175))
val ds3 = spark.createDataset(seq3)
ds3.union(ds2).show() //並集
ds3.except(ds2).show() // 差集
ds3.intersect(ds2).show() //交集
//5、select操作
ds2.select("name", "age").show()
ds2.select(expr("height + 1").as[Int].as("height")).show()
//6、排序操作
ds2.sort("age").show() //默認升序排序
ds2.sort($"age".desc, $"height".desc).show()
ds2.orderBy("age").show() //默認升序排序
ds2.orderBy($"age".desc, $"height".desc).show()
//7、分割抽樣操作
val ds4 = ds3.union(ds2)
val rands = ds4.randomSplit(Array(0.3, 0.7))
println(rands(0).count())
println(rands(1).count())
rands(0).show()
rands(1).show()
val ds5 = ds4.sample(true, 0.5)
println(ds5.count())
ds5.show()
//8、列操作
val ds6 = ds4.drop("height")
println(ds6.columns)
ds6.show()
val ds7 = ds4.withColumn("add", $"age" + 2)
println(ds7.columns)
ds7.show()
val ds8 = ds7.withColumnRenamed("add", "age_new")
println(ds8.columns)
ds8.show()
ds4.withColumn("add_col", lit(1)).show()
//9、join操作
val seq4 = Seq(Score("leo", 85), Score("jack", 63), Score("wjl", 70), Score("zyn", 90))
val ds9 = spark.createDataset(seq4)
val ds10 = ds2.join(ds9, Seq("name"), "inner")
ds10.show()
val ds11 = ds2.join(ds9, Seq("name"), "left")
ds11.show()
//10、分組聚合操作
val ds12 = ds4.groupBy("height").agg(avg("age").as("avg_age"))
ds12.show()
}
}
七、DataSet 的內置函數
八、例子:WordCount
package sparkml
import org.apache.spark.sql.SparkSession
object WordCount {
def main(args: Array[String]): Unit = {
//創建SparkSession
val spark = SparkSession.builder()
.appName("Dataset")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val data = spark.read.textFile("C://xzw//wordcount")
.flatMap(_.split(" "))
.map(_.toLowerCase())
.filter($"value"=!="," && $"value"=!="." && $"value"=!="not")
data.groupBy($"value").count().sort($"count".desc).show(50)
}
}