Spark SQL和DataSet(六)

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"寫在前面:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家好,我是強哥,一個熱愛分享的技術狂。目前已有 12 年大數據與AI相關項目經驗, 10 年推薦系統研究及實踐經驗。平時喜歡讀書、暴走和寫作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業餘時間專注於輸出大數據、AI等相關文章,目前已經輸出了40萬字的推薦系統系列精品文章,今年 6 月底會出版「構建企業級推薦系統:算法、工程實現與案例分析」一書。如果這些文章能夠幫助你快速入門,實現職場升職加薪,我將不勝歡喜。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想要獲得更多免費學習資料或內推信息,一定要看到文章最後喔。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"內推信息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你正在看相關的招聘信息,請加我微信:liuq4360,我這裏有很多內推資源等着你,歡迎投遞簡歷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"免費學習資料","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想獲得更多免費的學習資料,請關注同名公衆號【數據與智能】,輸入“資料”即可!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"學習交流羣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想找到組織,和大家一起學習成長,交流經驗,也可以加入我們的學習成長羣。羣裏有老司機帶你飛,另有小哥哥、小姐姐等你來勾搭!加小姐姐微信:epsila,她會帶你入羣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在前面兩篇文章中,我們討論了Spark SQL和DataFrame API。我們研究瞭如何連接到內置和外部數據源,查看了Spark SQL引擎相關的內容,並探討了諸如SQL和DataFrames之間的相互操作性,創建和管理視圖和表,以及高級DataFrame和SQL轉換等主題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管我們在第3章中簡要介紹了DataSet API ,但還是概述瞭如何在Spark中創建,存儲,序列化和反序列化DataSet (強類型分佈式集合)這些比較重要的方面。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本章中,我們將深入瞭解DataSet :我們將探索在Java和Scala中DataSet的有關用法 ,還有Spark如何管理內存以適應作爲高級API一部分的DataSet結構,以及與使用DataSet相關的成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Java和Scala的單一API","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如你從第三章中(圖3-1和表3-6)所記得的那樣,DataSet 爲強類型對象提供了統一且單一的API。在Spark支持的語言中,只有Scala和Java是強類型的。因此,Python和R只支持無類型的DataFrame API。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DataSet 是特定領域的類型對象,可以使用函數式編程或從DataFrame API熟悉的DSL運算符並行操作DataSet 。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於這個單一的API,Java開發人員不再有落後的風險。例如,Scala未來的任何接口或行爲的變化,如groupBy(),flatMap(),map(),或filter() 這些方法,Java API也會是一樣的,因爲它是一個單一的接口,有統一的規範,這對這兩種實現方式是類似的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Scala案例類和JavaBeans用於DataSet","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你還記得,從第3章(表3-2)可以知道,Spark 本身有內部的數據類型,如StringType,BinaryType,IntegerType,BooleanType和MapType,以便在Spark操作期間能夠無縫地映射到Scala和Java語言特定的數據類型。這種映射是通過編碼器完成的,我們將在本章後面討論。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了創建Dataset[T],其中T是Scala中類型化對象,也就是我們常說的泛型,因此你需要一個定義該對象的case類。爲了方便說明,在這裏使用第3章(表3-1)中的示例數據,我們有一個JSON文件,其中包含數以百萬計的博客作者關於Apache Spark的條目,這些條目採用以下格式:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"{id: 1, first: \"Jules\", last: \"Damji\", url: \"https://tinyurl.1\", date:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\"1/4/2016\", hits: 4535, campaigns: {\"twitter\", \"LinkedIn\"}},","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"...","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"{id: 87, first: \"Brooke\", last: \"Wenig\", url: \"https://tinyurl.2\", date:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\"5/5/2018\", hits: 8908, campaigns: {\"twitter\", \"LinkedIn\"}}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要創建分一個布式Dataset[Bloggers],我們必須首先定義一個Scala case類,該類定義包含Scala對象的每個單獨字段。這個case類作爲類型對象Bloggers的藍圖或數據結構:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"case class Bloggers(id:Int, first:String, last:String, url:String, date:String,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"hits: Int, campaigns:Array[String])","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在,我們可以從數據源讀取文件:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val bloggers = \"../data/bloggers.json\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val bloggersDS = spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .read","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .format(\"json\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"path\", bloggers)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .load()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .as[Bloggers]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生成的分佈式DataSet 中的每一行都是Bloggers類型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同樣,你也可以在Java創建Bloggers類型的JavaBean類,然後使用編碼器創建一個Dataset,如下所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Java","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.sql.Encoders;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import java.io.Serializable;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"public class Bloggers implements Serializable {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    private int id;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    private String first;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    private String last;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    private String url;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    private String date;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    private int hits;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    private Array[String] campaigns;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// JavaBean getters and setters","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"int getID() { return id; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"void setID(int i) { id = i; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"String getFirst() { return first; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"void setFirst(String f) { first = f; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"String getLast() { return last; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"void setLast(String l) { last = l; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"String getURL() { return url; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"void setURL (String u) { url = u; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"String getDate() { return date; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Void setDate(String d) { date = d; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"int getHits() { return hits; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"void setHits(int h) { hits = h; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Array[String] getCampaigns() { return campaigns; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"void setCampaigns(Array[String] c) { campaigns = c; }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Create Encoder","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Encoder BloggerEncoder = Encoders.bean(Bloggers.class);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"String bloggers = \"../bloggers.json\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dataset bloggersDS = spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .read","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .format(\"json\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"path\", bloggers)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .load()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .as(BloggerEncoder);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如你所見,在Scala和Java中創建DataSet 需要一些深思熟慮,因爲你必須要知道正在讀取的行的所有字段的列名和類型。在這一點是與DataFrames不同,對於DataFrame你可以選擇讓Spark推斷數據結構,但是Dataset API要求你提前定義好數據結構,並且case類或JavaBean類要與該數據結構匹配,否則會出現異常。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scala案例類或Java類定義中的字段名稱必須與數據源中的順序匹配。數據中每一行的列名會自動映射到類中的相應名稱,並且類型會自動保留。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果字段名稱與輸入數據匹配,則可以使用現有的Scala case類或JavaBean類。使用Dataset API與使用DataFrames一樣容易,簡潔和聲明式。對於大多數DataSet 的轉換,你可以使用在上一章中瞭解到的相同關係運算符。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們使用樣本來研究DataSet 更多的一些方面。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"處理DataSet","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"創建樣本DataSet 的一種簡單而動態的方法是使用SparkSession實例。在這種情況下,爲了方便說明,我們動態創建了一個包含三個字段的Scala對象:uid(用戶的唯一ID),uname(隨機生成的用戶名字符串)和usage(服務器或服務使用情況的分鐘數)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"創建樣本數據","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,讓我們生成一些樣本數據:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import scala.util.Random._","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Our case class for the Dataset","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"case class Usage(uid:Int, uname:String, usage: Int)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val r = new scala.util.Random(42)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Create 1000 instances of scala Usage class","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// This generates data on the fly","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val data = for (i usageEncoder = Encoders.bean(Usage.class);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Random rand = new Random();","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"rand.setSeed(42);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"List data = new ArrayList()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Create 1000 instances of Java Usage class","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"for (int i = 0; i < 1000; i++) {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  data.add(new Usage(i, \"user\" +","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  RandomStringUtils.randomAlphanumeric(5),","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  rand.nextInt(1000));","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Create a Dataset of Usage typed data","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dataset dsUsage = spark.createDataset(data, usageEncoder);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scala和Java之間生成的DataSet 將有所不同,因爲隨機種子算法可能有所不同。因此,你的Scala和Java的查詢結果將有所不同。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在我們有了生成的DataSet --- dsUsage,讓我們執行在上一章中完成的一些常見轉換操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"樣本數據轉換","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"回想一下,DataSet 是特定領域的對象的強類型集合。這些對象可以使用功能或關係操作進行並行轉換操作。這些轉換的例子包括map(),reduce(),filter(),select()和aggregate()。作爲高階函數的示例,這些方法可以通過lambda,閉包或函數作爲參數並返回結果。因此,它們非常適合函數式編程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scala是一種函數式編程語言,最近,lambda,函數自變量和閉包也已添加到Java中。讓我們嘗試一些Spark中的高階函數,並對之前創建的示例數據使用函數式編程結構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"高階函數和函數式編程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"舉一個簡單的例子,讓我們使用它filter()來返回dsUsage DataSet 中所有使用時間超過900分鐘的用戶。一種實現方法是使用函數表達式作爲filter()方法的參數:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.sql.functions._","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"dsUsage","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .filter(d => d.usage > 900)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .orderBy(desc(\"usage\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .show(5, false)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一種方法是定義一個函數並將該函數作爲參數提供給filter():","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"def filterWithUsage(u: Usage) = u.usage > 900","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"dsUsage.filter(filterWithUsage(_)).orderBy(desc(\"usage\")).show(5) ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+---+----------+-----+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|uid| uname|usage|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+---+----------+-----+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|561|user-5n2xY| 999|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|113|user-nnAXr| 999|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|605|user-NL6c4| 999|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|634|user-L0wci| 999|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|805|user-LX27o| 996|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+---+----------+-----+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"only showing top 5 rows","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在第一種情況下,我們使用lambda表達式{d.usage > 900}作爲filter()方法的參數,而在第二種情況下,我們定義了Scala函數def filterWithUsage(u: Usage) = u.usage > 900。在這兩種情況下,該filter()方法都會在分佈式Dataset中的對象的每一行上進行迭代Usage對象,並應用表達式或執行該函數,Usage爲表達式或函數值作爲一行,當布爾值爲true時,返回一個具有Usage 類型的新的Dataset。(有關方法簽名的詳細信息,請參見Scala文檔。)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Java中,用於filter()參數的類型爲FilterFunction。可以匿名內聯定義或使用命名函數定義。在此示例中,我們將通過名稱定義函數並將其分配給變量f。在filter()中應用此函數將返回一個新的Dataset,其中包含我們的過濾條件爲true的所有行:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Java","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Define a Java filter function","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FilterFunction f = new FilterFunction() {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   public boolean call(Usage u) {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"       return (u.usage > 900);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"};","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Use filter with our function and order the results in descending order","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"dsUsage.filter(f).orderBy(col(\"usage\").desc()).show(5);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+---+----------+-----+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|uid|uname     |usage|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+---+----------+-----+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|67 |user-qCGvZ|997  |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|878|user-J2HUU|994  |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|668|user-pz2Lk|992  |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|750|user-0zWqR|991  |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|242|user-g0kF6|989  |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+---+----------+-----+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"only showing top 5 rows","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"並非所有的lambda或函數參數都必須求Boolean值。他們也可以返回計算值。考慮使用高階函數的示例map(),我們的目標是找出usage中價值超過特定閾值的每個用戶的使用成本,以便我們爲每位用戶提供每分鐘的特殊價格。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Use an if-then-else lambda expression and compute a value","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"dsUsage.map(u => {if (u.usage > 750) u.usage * .15 else u.usage * .50 })","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .show(5, false)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Define a function to compute the usage","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"def computeCostUsage(usage: Int): Double = {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  if (usage > 750) usage * 0.15 else usage * 0.50","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Use the function as an argument to map()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"dsUsage.map(u => {computeCostUsage(u.usage)}).show(5, false)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|value |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|262.5 |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|251.0 |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|85.0  |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|136.95|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|123.0 |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"only showing top 5 rows","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要在Java中使用map(),必須定義一個MapFunction。可以是匿名類,也可以是extends的已定義類MapFunction。在此示例中,我們將其內聯使用,即在方法中調用自己:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Java","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Define an inline MapFunction","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"dsUsage.map((MapFunction) u -> {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   if (u.usage > 750)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"       return u.usage * 0.15;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   else","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"       return u.usage * 0.50;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"}, Encoders.DOUBLE()).show(5); // We need to explicitly specify the Encoder","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|value |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|65.0  |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|114.45|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|124.0 |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|132.6 |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|145.5 |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"only showing top 5 rows","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管我們已經計算了使用成本的值,但是我們不知道計算的值與哪些用戶相關聯。我們如何獲得這些信息?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"步驟很簡單:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 創建一個Scala case樣例類或JavaBean類UsageCost,並添加一個名爲cost的其他字段或列。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 定義一個函數來計算cost然後在map()方法中使用它。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是Scala中的樣子:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Create a new case class with an additional field, cost","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"case class UsageCost(uid: Int, uname:String, usage: Int, cost: Double)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Compute the usage cost with Usage as a parameter","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Return a new object, UsageCost","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"def computeUserCostUsage(u: Usage): UsageCost = {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  val v = if (u.usage > 750) u.usage * 0.15 else u.usage * 0.50","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    UsageCost(u.uid, u.uname, u.usage, v)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Use map() on our original Dataset","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"dsUsage.map(u => {computeUserCostUsage(u)}).show(5)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+---+----------+-----+------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|uid|     uname|usage|  cost|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+---+----------+-----+------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|  0|user-Gpi2C|  525| 262.5|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|  1|user-DgXDi|  502| 251.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|  2|user-M66yO|  170|  85.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|  3|user-xTOn6|  913|136.95|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|  4|user-3xGSz|  246| 123.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+---+----------+-----+------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"only showing top 5 rows","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在,我們有了一個轉換後的DataSet ,其中有一個新的列cost ,由我們map()轉換中的函數以及所有其他列計算得出。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同樣,在Java中,如果我們想要與每個用戶相關的成本,則需要定義一個JavaBean類UsageCost和MapFunction。有關完整的JavaBean示例,請參見本書的GitHub repo;爲了簡潔起見,我們僅在此處顯示內聯MapFunction:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Java","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Get the Encoder for the JavaBean class","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Encoder usageCostEncoder = Encoders.bean(UsageCost.class);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Apply map() function to our data","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"dsUsage.map( (MapFunction) u -> {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"       double v = 0.0;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"       if (u.usage > 750) v = u.usage * 0.15; else v = u.usage * 0.50;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"       return new UsageCost(u.uid, u.uname,u.usage, v); },","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"         usageCostEncoder).show(5);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+------+---+----------+-----+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|  cost|uid|     uname|usage|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+------+---+----------+-----+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|  65.0|  0|user-xSyzf|  130|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|114.45|  1|user-iOI72|  763|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| 124.0|  2|user-QHRUk|  248|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| 132.6|  3|user-8GTjo|  884|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| 145.5|  4|user-U4cU1|  970|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+------+---+----------+-----+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"only showing top 5 rows","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於使用高階函數和DataSet ,需要注意以下幾點:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.我們使用類型化的JVM對象作爲函數的參數。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.我們使用點表示法(來自面向對象的編程)來訪問類型化的JVM對象內的各個字段,從而使其更易於閱讀。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.我們的某些函數和lambda簽名可以是類型安全的,從而確保編譯時錯誤檢測並指示Spark處理哪些數據類型,執行哪些操作等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4.使用lambda表達式中的Java或Scala語言功能,我們的代碼可讀,表達和簡潔。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5.Spark在Java和Scala中都提供了等效的map()和filter(),沒有高級函數的構造,因此你不必被迫對Datasets或DataFrames使用函數式編程。相反,你可以簡單地使用條件DSL運算符或SQL表達式:例如dsUsage.filter(\"usage > 900\")或dsUsage($\"usage\" > 900)。(有關此內容的更多信息,請參見“使用DataSet 的成本”。)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6.對於DataSet ,我們使用編碼器,這是一種在JVM和Spark內部二進制格式之間針對其數據類型有效地轉換數據的機制(有關更多信息,請參見“DataSet 編碼器”中的內容)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"函數和函數式編程並非Spark DataSet 獨有。你也可以將它們與DataFrames一起使用。回想一下,DataFrame是一個Dataset[Row],其中Row是一個通用的無類型JVM對象,可以容納不同類型的字段。方法簽名採用在其上進行操作的表達式或函數Row,這意味着每個Row數據類型都可以作爲表達式或函數的輸入值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"將DataFrame轉換爲DataSet","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了對查詢和構造進行強類型檢查,可以將DataFrames轉換爲Datasets。要將現有的DataFrame df轉換爲SomeCaseClass類型的Dataset ,只需使用df.as[SomeCaseClass]表示即可。我們之前看到了一個示例:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val bloggersDS = spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .read","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .format(\"json\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"path\", \"/data/bloggers/bloggers.json\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .load()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .as[Bloggers]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.read.format(\"json\")返回一個DataFrame,在Scala中是的類型別名Dataset[Row]。Using.as[Bloggers]指示Spark使用編碼器(將在本章稍後討論)將對象從Spark的內部內存表示形式序列化/反序列化爲JVM Bloggers對象。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"DataSet 和DataFrame 的內存管理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark是一個內存密集型的分佈式大數據引擎,因此其有效使用內存對其執行速度至關重要。在其整個發行歷史中,Spark的內存使用已發生了顯着的變化:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.Spark 1.0使用基於RDD的Java對象進行內存存儲,序列化和反序列化,這在資源和速度方面付出的代價都非常高。而且,存儲空間是","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"在Java堆上","attrs":{}},{"type":"text","text":"分配","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"的,","attrs":{}},{"type":"text","text":"因此對於大型數據集你只能","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"依靠","attrs":{}},{"type":"text","text":"JVM的垃圾回收(GC)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.Spark 1.x引入了Tungsten項目。它的突出特點之一是一種新的基於行的內部格式,該格式使用偏移量和指針在堆外存儲器中佈局DataSet 和DataFrame 。Spark使用一種稱爲","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"編碼器","attrs":{}},{"type":"text","text":"的高效機制在JVM及其內部Tungsten格式之間進行序列化和反序列化。堆外分配內存意味着減少了GC對Spark的影響。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.Spark 2.x引入了第二代Tungsten引擎,具有整個階段的代碼生成和基於列的矢量化內存佈局的功能。該新版本以現代編譯器的思想和技術爲基礎,還利用現代CPU和緩存體系結構,以“單條指令,多個數據”(SIMD)方法進行快速並行數據訪問。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"DataSet 編碼器","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"編碼器將堆外內存中的數據從Spark的內部Tungsten格式轉換爲JVM Java對象。換句話說,它們將Dataset對象從Spark的內部格式序列化和反序列化爲JVM對象,包括原始數據類型。例如,Encoder[T]將會從Spark的內部Tungsten格式轉換爲Dataset[T]。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark內置支持爲原始類型(例如,字符串,整數,長整數)、Scala case類和JavaBeans自動生成生成編碼器。與Java和Kryo​​的序列化和反序列化相比,Spark編碼器明顯更快。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們先前的Java示例中,我們顯式創建了一個編碼器:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Encoder usageCostEncoder = Encoders.bean(UsageCost.class);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是,對於Scala,Spark會自動爲這些高效的轉換器生成字節碼。讓我們看一下Spark內部基於Tungsten行的格式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Spark的內部格式與Java對象格式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Java對象的開銷很大,包括標頭信息,哈希碼,Unicode信息等。即使是簡單的Java字符串(如“ abcd”)也需要48字節的存儲空間,而不是你想象的4字節。可以想象一下創建一個MyClass(Int, String, String)對象的開銷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark不會爲DataSet 或DataFrame創建基於JVM的對象,而是分配","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"堆外","attrs":{}},{"type":"text","text":"Java內存來佈局其數據,並使用編碼器將數據從內存中表示形式轉換爲JVM對象。例如,圖6-1顯示瞭如何在MyClass(Int, String, String)內部存儲JVM對象。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3d/3d41f7e3e50f7cd9acefa12f0da14b52.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當數據以這種連續方式存儲並且可以通過指針算術和offets訪問時,編碼器可以快速序列化或反序列化該數據。這意味着什麼?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"序列化和反序列化(SerDe)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分佈式計算中,這並不是一個新概念,在分佈式計算中,數據經常通過網絡在羣集中的計算機節點之間傳播,序列化和反序列化是將類型化對象編碼(序列化)爲發送方的二進制表示或格式,接收方從二進制格式轉換爲重新規範的數據類型對象的過程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,如果JVM對象MyClass在圖6-1有在Spark集羣節點之間共享,發送者將序列成字節數組,並且接收器將它反序列化爲MyClass類型的JVM對象。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"JVM擁有自己的內置Java序列化器和反序列化器,但是效率低下,因爲(如上一節所述)JVM在堆內存中創建的Java對象會膨脹。因此,這個過程很緩慢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出於以下幾個原因,這是Dataset編碼器可以解決的地方:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.Spark的內部Tungsten二進制格式(見圖6-1和6-2)將對象存儲在Java堆內存之外,並且結構緊湊,因此這些對象佔用的空間更少。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.編碼器可以使用帶有內存地址和偏移量的簡單指針算法遍歷整個存儲器,從而快速進行序列化(圖6-2)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.在接收端,編碼器可以將二進制表示形式快速反序列化爲Spark的內部表示形式。編碼器不受JVM的垃圾收集暫停的阻礙。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/40/40d213eacf06e7bb1b5d45d64b0cce0d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是,正如我們接下來討論的那樣,生活中大多數美好的事物都是有代價的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"使用DataSet 的成本","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在第3章的“DataFrame與DataSet ”中,我們概述了使用DataSet 的一些好處——但是這些好處是有代價的。正如前面所指出的那樣,當DataSet 被傳遞到高階函數如,filter()、map()和flatMap(),或作爲參數傳遞給lambdas 方法時,存在與從Spark的內部Tungsten格式反序列化到JVM對象的成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與在Spark中引入編碼器之前使用的其他串行器相比,此開銷較小且可以接受。但是,在較大的DataSet 和密集查詢中,這些成本會累積並可能影響性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"降低成本的策略","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"減輕過度序列化和反序列化的一種策略是在查詢中使用DSL表達式,並避免過度使用lambda作爲匿名函數作爲高階函數的參數。因爲lambda在運行前一直是匿名的,並且對Catalyst優化器是不透明的,所以當你使用它們時,它不能有效地識別你在做什麼(你沒有告訴Spark","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"該做什麼","attrs":{}},{"type":"text","text":"),因此無法優化查詢(請參閱第三章中“ Catalyst Optimizer”這一節))。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二種策略是將查詢鏈接在一起,從而儘量減少序列化和反序列化。將查詢鏈接在一起是Spark中的一種常見做法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們用一個簡單的例子來說明。假設我們有一個類型爲Person的DataSet ,其中Person定義爲Scala案例類:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Person(id: Integer, firstName: String, middleName: String, lastName: String,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"gender: String, birthDate: String, ssn: String, salary: String)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們想使用函數式編程對此DataSet 發出一組查詢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們檢查一個我們無效地編寫查詢的情況,以便我們無意地承擔重複序列化和反序列化的代價:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import java.util.Calendar","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val earliestYear = Calendar.getInstance.get(Calendar.YEAR) - 40","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"personDS","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  // Everyone above 40: lambda-1","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .filter(x => x.birthDate.split(\"-\")(0).toInt > earliestYear)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  // Everyone earning more than 80K","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .filter($\"salary\" > 80000)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  // Last name starts with J: lambda-2","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .filter(x => x.lastName.startsWith(\"J\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  // First name starts with D","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .filter($\"firstName\".startsWith(\"D\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .count()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如你在圖6-3中所觀察到的,每次我們從lambda遷移到DSL(filter($\"salary\" > 8000))時,都會產生序列化和反序列化Person JVM對象的成本。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/33/3368d527f7fabbed17089bfb9ae030e9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":6,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"圖6-3。用lambda和DSL鏈接查詢的低效方式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比之下,以下查詢僅使用DSL,不使用lambda。結果,它的效率更高,整個組合查詢和鏈接查詢都不需要序列化/反序列化:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"personDS","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .filter(year($\"birthDate\") > earliestYear) // Everyone above 40","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .filter($\"salary\" > 80000) // Everyone earning more than 80K","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .filter($\"lastName\".startsWith(\"J\")) // Last name starts with J","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .filter($\"firstName\".startsWith(\"D\")) // First name starts with D","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .count()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出於好奇心裏,你可以在本書的GitHub倉庫中查看本章筆記中兩次運行之間的時間差異。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本章中,我們詳細介紹瞭如何在Java和Scala中使用DataSet 。我們探索了Spark如何管理內存以適應DataSet 構造(作爲其統一和高級API的一部分),並且我們考慮了與使用DataSet 相關的一些成本以及如何減少這些成本。我們還向你展示瞭如何在Spark中使用Java和Scala的函數式編程構造。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,我們瞭解了編碼器如何從Spark的內部Tungsten二進制格式到JVM對象進行序列化和反序列化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在下一章中,我們將介紹如何通過檢查高效的I/O策略、優化和調整Spark配置以及在調試Spark應用程序時要查找的屬性和篩選值來優化Spark。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章