spark查詢任意字段,並使用dataframe輸出結果

在寫spark程序中,查詢csv文件中某個字段,一般是這樣的寫法:
**方法(1)，**直接使用dataframe 查詢

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .schema(customSchema)
    .load("cars.csv")
val selectedData = df.select("year", "model")

參考索引:https://github.com/databricks/spark-csv

以上讀csv文件是spark1.x的寫法,spark2.x的寫法又不太一樣:
val df = sparkSession.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("people.csv").cache()

方法(2)，構建case class.

case class Person(name: String, age: Long)
// For implicit conversions from RDDs to DataFrames
import spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
  .textFile("examples/src/main/resources/people.txt")
  .map(_.split(","))
  .map(attributes => Person(attributes(0), attributes(1).trim.toInt))
  .toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")

// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

這是spark2.2.0網站上面的例子.

參考索引:http://spark.apache.org/docs/latest/sql-programming-guide.html

以上2種寫法,如果只是測試一下小文件,文件的列頭的字段不多(幾十個)以內是可以用的.比如我只查詢某個用戶的Name, Age, Sex 這幾個字段.

但是實際上,會遇到這種問題:

我不確定要查哪些字段;
我不確定要查幾個字段.

上面的例子就不夠用了.恰好有第三種方法，
方法(3):

import org.apache.spark.sql.types._

// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
  .map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
  .map(_.split(","))
  .map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// |        value|
// +-------------+
// |Name: Michael|
// |   Name: Andy|
// | Name: Justin|
// +-------------+

上面的例子,也是來自spark網站,仍然會使用dataframe,不過查詢的字段結構,使用StructField 和StructType .查詢的每個字段, 使用數字代替,而不是具體的Name,Age 字段名.不過,例(3)的使用效果跟例(1),(2)類似,沒法解決上面提出的問題,還需要改進一下.

例(4):

val df = sparkSession.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("people.csv").cache()
var schemaString = "name,age"
//註冊臨時表
df.createOrReplaceTempView("people")
//sql 查詢
var dataDF = sparkSession.sql("select "+schemaString+" from people")
//轉rdd
var dfrdd = dataDF.rdd
val fields = schemaString.split(",").map(fieldName => StructField(fieldName, StringType, nullable = true))
var schema = StructType(fields)
//將rdd轉成df
var newDF=sparkSession.createDataFrame(dfrdd, schema)

這樣就可以實現以上提出的問題.

dataframe是很快的，特別是新的版本中。當然，在生產環境中，我們可能仍然用RDD去轉換想要的數據。這時，可以這樣寫：

//將csv中某一整行的數組抽取需要的字段，並轉成一個數組
比如csv文件有15列，獲取某2列：NAME",“AGE”
可以通過數組的匹配，主要思路是：

讀取某個csv文件時,獲取csv表頭第一行的某個字段所在的位置,比如第n列,生成數組1;
將要查詢的字段放進數組2;
將數組2與數組1匹配,記錄數組2中的字段在數組1中的位置,最後生成一個新的數組3;
數組3就是記錄要讀寫的字段在所有字段數組中的位置,利用數組3,就能不需要使用具體字段名,以及字段的個數等數據, 方便讀寫數據.

val queryArr = Array(“NAME”,“AGE”)

    val rowRDD2 = rowRDD.map(attributes => {
      val myattributes : Array[String] = attributes
      //包含要查詢的字段所在列的位置的數組，比如，第n列
      val mycolumnsNameIndexArr : Array[Int] = colsNameIndexArrBroadcast.value
      var mycolumnsNameDataArrb : ArrayBuffer[String] = new ArrayBuffer[String]()
      for(i<-0 until mycolumnsNameIndexArr.length){
        mycolumnsNameDataArrb+=myattributes(mycolumnsNameIndexArr(i)).toString
      }
      val mycolumnsNameDataArr : Array[String] = mycolumnsNameDataArrb.toArray
      mycolumnsNameDataArr
    }).map(x => Row(x)).cache()

這裏的attributes就是數組1，mycolumnsNameIndexArr 是數組3.
這樣，結果返回的rdd每一行就是一個數組，再根據行數遍歷，可以把行轉換成列。

生成數組3的方法如下：

/**
    * Description:獲取csv表頭第一行的某個字段所在的位置,比如第n列
    * Author: zhouyang
    * Date 2017/11/14 11:13
    * @param header : String
    * @param columnsNameArr : Array[String]
    * @return     Int
    */
  def getColumnsNameIndexArr(header : String, columnsNameArr : Array[String]) : Array[Int] ={
    val tempHeaderArr : Array[String] = header.split(",")
    var indexArrb = new ArrayBuffer[Int]()
    if(tempHeaderArr.length>0){
      for(j<-0 until columnsNameArr.length){
        val columnsNameStrTemp = columnsNameArr(j)
        var i : Int = 0
        breakable {
          while(i<tempHeaderArr.length){
            if(columnsNameStrTemp.equals(tempHeaderArr(i))){
              indexArrb+=i
              break
            }
            else{
              i+=1
            }
          }
        }

      }
      if(indexArrb.length<1){
        //沒有匹配的列名
        logger.info("getColumnsNameIndex:tempHeaderArr.length==0")
        indexArrb+:=(-2)
        return indexArrb.toArray
      }
      return indexArrb.toArray
    }
    else{
      logger.info("getColumnsNameIndex:沒有匹配的列名")
      indexArrb+=(-1)
      return indexArrb.toArray
    }
  }

詳見文章：https://blog.csdn.net/cafebar123/article/details/79509456