在寫spark程序中,查詢csv文件中某個字段,一般是這樣的寫法:
**方法(1),**直接使用dataframe 查詢
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.schema(customSchema)
.load("cars.csv")
val selectedData = df.select("year", "model")
參考索引:https://github.com/databricks/spark-csv
以上讀csv文件是spark1.x的寫法,spark2.x的寫法又不太一樣:
val df = sparkSession.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("people.csv")
.cache()
方法(2),構建case class.
case class Person(name: String, age: Long)
// For implicit conversions from RDDs to DataFrames
import spark.implicits._
// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")
// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+
這是spark2.2.0網站上面的例子.
參考索引:http://spark.apache.org/docs/latest/sql-programming-guide.html
以上2種寫法,如果只是測試一下小文件,文件的列頭的字段不多(幾十個)以內是可以用的.比如我只查詢某個用戶的Name, Age, Sex 這幾個字段.
但是實際上,會遇到這種問題:
- 我不確定要查哪些字段;
- 我不確定要查幾個字段.
上面的例子就不夠用了.恰好有第三種方法,
方法(3):
import org.apache.spark.sql.types._
// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0), attributes(1).trim))
// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")
// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// | value|
// +-------------+
// |Name: Michael|
// | Name: Andy|
// | Name: Justin|
// +-------------+
上面的例子,也是來自spark網站,仍然會使用dataframe,不過查詢的字段結構,使用StructField 和StructType .查詢的每個字段, 使用數字代替,而不是具體的Name,Age 字段名.不過,例(3)的使用效果跟例(1),(2)類似,沒法解決上面提出的問題,還需要改進一下.
例(4):
val df = sparkSession.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("people.csv").cache()
var schemaString = "name,age"
//註冊臨時表
df.createOrReplaceTempView("people")
//sql 查詢
var dataDF = sparkSession.sql("select "+schemaString+" from people")
//轉rdd
var dfrdd = dataDF.rdd
val fields = schemaString.split(",").map(fieldName => StructField(fieldName, StringType, nullable = true))
var schema = StructType(fields)
//將rdd轉成df
var newDF=sparkSession.createDataFrame(dfrdd, schema)
這樣就可以實現以上提出的問題.
dataframe是很快的,特別是新的版本中。當然,在生產環境中,我們可能仍然用RDD去轉換想要的數據。這時,可以這樣寫:
//將csv中某一整行的數組抽取需要的字段,並轉成一個數組
比如csv文件有15列,獲取某2列:NAME",“AGE”
可以通過數組的匹配,主要思路是 :
- 讀取某個csv文件時,獲取csv表頭第一行的某個字段所在的位置,比如第n列,生成數組1;
- 將要查詢的字段放進數組2;
- 將數組2與數組1匹配,記錄數組2中的字段在數組1中的位置,最後生成一個新的數組3;
- 數組3就是記錄要讀寫的字段在所有字段數組中的位置,利用數組3,就能不需要使用具體字段名,以及字段的個數等數據, 方便讀寫數據.
val queryArr = Array(“NAME”,“AGE”)
val rowRDD2 = rowRDD.map(attributes => {
val myattributes : Array[String] = attributes
//包含要查詢的字段所在列的位置的數組,比如,第n列
val mycolumnsNameIndexArr : Array[Int] = colsNameIndexArrBroadcast.value
var mycolumnsNameDataArrb : ArrayBuffer[String] = new ArrayBuffer[String]()
for(i<-0 until mycolumnsNameIndexArr.length){
mycolumnsNameDataArrb+=myattributes(mycolumnsNameIndexArr(i)).toString
}
val mycolumnsNameDataArr : Array[String] = mycolumnsNameDataArrb.toArray
mycolumnsNameDataArr
}).map(x => Row(x)).cache()
這裏的attributes就是數組1,mycolumnsNameIndexArr 是數組3.
這樣,結果返回的rdd每一行就是一個數組,再根據行數遍歷,可以把行轉換成列。
生成數組3的方法如下:
/**
* Description:獲取csv表頭第一行的某個字段所在的位置,比如第n列
* Author: zhouyang
* Date 2017/11/14 11:13
* @param header : String
* @param columnsNameArr : Array[String]
* @return Int
*/
def getColumnsNameIndexArr(header : String, columnsNameArr : Array[String]) : Array[Int] ={
val tempHeaderArr : Array[String] = header.split(",")
var indexArrb = new ArrayBuffer[Int]()
if(tempHeaderArr.length>0){
for(j<-0 until columnsNameArr.length){
val columnsNameStrTemp = columnsNameArr(j)
var i : Int = 0
breakable {
while(i<tempHeaderArr.length){
if(columnsNameStrTemp.equals(tempHeaderArr(i))){
indexArrb+=i
break
}
else{
i+=1
}
}
}
}
if(indexArrb.length<1){
//沒有匹配的列名
logger.info("getColumnsNameIndex:tempHeaderArr.length==0")
indexArrb+:=(-2)
return indexArrb.toArray
}
return indexArrb.toArray
}
else{
logger.info("getColumnsNameIndex:沒有匹配的列名")
indexArrb+=(-1)
return indexArrb.toArray
}
}
詳見文章:https://blog.csdn.net/cafebar123/article/details/79509456