Spark 快速入門--編程及運行

Apache Spark 快速入門

本指南提供了使用Spark的快速介紹. 通過Spark的交互式Scala shell腳本 (不要擔心你不知道Scala –對此不需要知道太多),我們將會對API有個初步認識,然後會介紹如何使用 Scala, Java, 和Python編寫獨立的應用程序. 查看programming guide會得到完整的參考手冊.

千里之行始於足下, 此刻我們只需要在機器上構建Spark即可. 進入Spark目錄然後運行:

$ sbt/sbt assembly

使用Spark Shell的交互分析

基礎

Spark的交互式shell腳本提供了一條學習API的捷徑,同時也是用於交互式分析數據集的強大工具。在Spark目錄下運行 ./bin/spark-shell 即可進入腳本.

Spark的基本抽象是名爲彈性分佈式數據集(Resilient Distributed Dataset,即RDD)的分佈式集合對象. RDDs 可以來自 Hadoop InputFormats (例如 HDFS files) ,或者通過其他RDDs轉換得到. 讓我們從Spark源文件目錄的README文件建立一個新RDD:

scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

RDDs 具有許多 actions, 這些actions可以返回values和 transformations(返回新的RDDs). 這是幾個簡單的actions:

scala> textFile.count() // Number of items in this RDD
res0: Long = 74

scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark

現在我們來執行一個 transformation. 使用一個 filter transformation 過濾原始文件, 以返回一個包含原始文件子集的新的 RDD.

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09

也可以連起來使用 transformations 和 actions:

scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15

更多的RDD操作

RDD actions 和 transformations 可以用於許多複雜計算. 例如要找出一行中單詞最多的行:

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 16

第一步做了一個line到int數值的map, 建立了一個新的 RDD. 然後在這個RDD中調用 reduce 找到最大的行中單詞數值. mapreduce的參數是 Scala 函數常量 (閉包), 其中可以使用 Scala/Java 庫的語言特性. 例如, 可以調用在別處聲明的函數. 使用 Math.max() 函數可以讓剛纔的例子變的簡單和易於理解:

scala> import java.lang.Math
import java.lang.Math

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res5: Int = 16

一個常見的數據流模式是因Hadoop而流行的MapReduce. Spark可以非常簡單的實現MapReduce流程:

scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: spark.RDD[(java.lang.String, Int)] = spark.ShuffledAggregatedRDD@71f027b8

這裏,通過組合 flatMapmap 和 reduceByKey transformations 來計算文件中每個單詞的數目,返回一個(String, Int) 數據對的 RDD對象 . 爲了在腳本中收集單詞的數量, 可以使用 collect action:

scala> wordCounts.collect()
res6: Array[(java.lang.String, Int)] = Array((need,2), ("",43), (Extra,3), (using,1), (passed,1), (etc.,1), (its,1), (`/usr/local/lib/libmesos.so`,1), (`SCALA_HOME`,1), (option,1), (these,1), (#,1), (`PATH`,,2), (200,1), (To,3),...

緩存

Spark支持將數據集加入集羣範圍的內存緩存. 這在數據集頻繁存取的時候尤其有用, 比如要查詢一個小的熱點數據集或是執行像PageRank這樣的迭代算法. 下面把 linesWithSpark 數據集做個緩存:

scala> linesWithSpark.cache()
res7: spark.RDD[String] = spark.FilteredRDD@17e51082

scala> linesWithSpark.count()
res8: Long = 15

scala> linesWithSpark.count()
res9: Long = 15

用Spark顯示和緩存一個30行的文本看起來有點傻. 有意思的在於可以用同樣的函數處理很大的數據集, 即使他們分佈在成千上萬的節點上. 通過 bin/spark-shell 鏈接到集羣后即可交互式的使用這些函數, 詳情參見 programming guide.

使用Scala編寫的單獨程序

Now say we wanted to write a standalone application using the Spark API. We will walk through a simple application in both Scala (with SBT), Java (with Maven), and Python. If you are using other build systems, consider using the Spark assembly JAR described in the developer guide.

We’ll create a very simple Spark application in Scala. So simple, in fact, that it’s named SimpleApp.scala:

/*** SimpleApp.scala ***/
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "$YOUR_SPARK_HOME/README.md" // Should be some file on your system
    val sc = new SparkContext("local", "Simple App", "YOUR_SPARK_HOME",
      List("target/scala-2.10/simple-project_2.10-1.0.jar"))
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in the Spark README. Note that you’ll need to replace $YOUR_SPARK_HOME with the location where Spark is installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkContext, we initialize a SparkContext as part of the proogram. We pass the SparkContext constructor four arguments, the type of scheduler we want to use (in this case, a local scheduler), a name for the application, the directory where Spark is installed, and a name for the jar file containing the application’s code. The final two arguments are needed in a distributed setting, where Spark is running across several nodes, so we include them for completeness. Spark will automatically ship the jar files you list to slave nodes.

This file depends on the Spark API, so we’ll also include an sbt configuration file, simple.sbt which explains that Spark is a dependency. This file also adds a repository that Spark depends on:

name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.3"

libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-incubating"

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

If you also wish to read data from Hadoop’s HDFS, you will also need to add a dependency on hadoop-client for your version of HDFS:

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<your-hdfs-version>"

Finally, for sbt to work correctly, we’ll need to layout SimpleApp.scala and simple.sbt according to the typical directory structure. Once that is in place, we can create a JAR package containing the application’s code, then use sbt/sbt run to execute our program.

$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala

$ sbt/sbt package
$ sbt/sbt run
...
Lines with a: 46, Lines with b: 23


使用Java編寫的單獨程序

現在我們使用Java API來編寫一個獨立的程序. 使用Maven一步一步搞定. 如果你使用了其他的構建系統, 請參考開發指南中 Spark 集合JAR的部分.

下面來寫一個非常簡單的Spark程序, SimpleApp.java:

/*** SimpleApp.java ***/
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;

public class SimpleApp {
  public static void main(String[] args) {
    String logFile = "$YOUR_SPARK_HOME/README.md"; // Should be some file on your system
    JavaSparkContext sc = new JavaSparkContext("local", "Simple App",
      "$YOUR_SPARK_HOME", new String[]{"target/simple-project-1.0.jar"});
    JavaRDD<String> logData = sc.textFile(logFile).cache();

    long numAs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("a"); }
    }).count();

    long numBs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("b"); }
    }).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
  }
}

這個程序計算了文件中包含字符 ‘a’的行數和包含字符 ‘b’的函數. 注意這裏需要把 $YOUR_SPARK_HOME 替換爲Spark的安裝路徑. 同 Scala 的樣例一樣, 需要初始化 SparkContext, 這裏使用了一個更爲友好的 JavaSparkContext 類. 我們可以新建RDDs (表現爲 JavaRDD) 並進行轉換. 最後,傳遞函數使用了繼承spark.api.java.function.Function的構建類.  Java編程指南 描述了這些細節上的不同.

可以編寫Maven的 pom.xml 文件將Spark作依賴. 請注意 Spark artifacts 中標記了 Scala 的版本.

<project>
  <groupId>edu.berkeley</groupId>
  <artifactId>simple-project</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>Simple Project</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <repositories>
    <repository>
      <id>Akka repository</id>
      <url>http://repo.akka.io/releases</url>
    </repository>
  </repositories>
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>0.9.0-incubating</version>
    </dependency>
  </dependencies>
</project>

如果同時需要從 Hadoop的 HDFS中讀取數據, 則需要爲所使用版本的 HDFS添加 hadoop-client 依賴:

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-client</artifactId>
  <version>...</version>
</dependency>

其文件結構遵循典型的Maven文件架構:

$ find .
./pom.xml
./src
./src/main
./src/main/java
./src/main/java/SimpleApp.java

現在,你可以通過Maven來執行這個程序:

$ mvn package
$ mvn exec:java -Dexec.mainClass="SimpleApp"
...
Lines with a: 46, Lines with b: 23

使用Python編寫的單獨程序

Now we will show how to write a standalone application using the Python API (PySpark).

As an example, we’ll create a simple Spark application, SimpleApp.py:

"""SimpleApp.py"""
from pyspark import SparkContext

logFile = "$YOUR_SPARK_HOME/README.md"  # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()

numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

print "Lines with a: %i, lines with b: %i" % (numAs, numBs)

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace $YOUR_SPARK_HOME with the location where Spark is installed. As with the Scala and Java examples, we use a SparkContext to create RDDs. We can pass Python functions to Spark, which are automatically serialized along with any variables that they reference. For applications that use custom classes or third-party libraries, we can add those code dependencies to SparkContext to ensure that they will be available on remote machines; this is described in more detail in the Python programming guideSimpleApp is simple enough that we do not need to specify any code dependencies.

We can run this application using the bin/pyspark script:

$ cd $SPARK_HOME
$ ./bin/pyspark SimpleApp.py
...
Lines with a: 46, Lines with b: 23

在集羣上運行

這有一些額外的注意事項,當程序運行在一個 SparkYARN, 或者 Mesos 集羣上.

加入依賴包

當你的程序依賴於很多其他庫的時候, 你需要確保在 slave nodes中也引入了這些包. 一個流行的方法是建立一個集合所有代碼和依賴庫的jar包. sbt 和Maven 都有提供集合功能的插件. 當構建集合jars時, 列出Spark作爲一個已提供的依賴; 在slaves中存在就不需要捆綁了. 一旦你有了一個集合包, 將他加入SparkContext.你也可以用SparkContext的addJar方法一步步添加你的依賴包.

對於Python, 可以使用SparkContext的pyFiles參數 或 addPyFile方法添加 .py.zip or .egg 等文件到分佈式系統.

配置選項

Spark 包含若干可以改變程序行爲的配置選項( configuration options )。這需要構建一個 SparkConf 對象到SparkContext 構造器. 例如, 在Java 和Scala 環境下, 你可以這樣寫:

import org.apache.spark.{SparkConf, SparkContext}
val conf = new SparkConf()
             .setMaster("local")
             .setAppName("My application")
             .set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)

使用Python可以這樣寫:

from pyspark import SparkConf, SparkContext
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("My application")
conf.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)

 在Hadoop Filesystems中存取數據

這個例子是存取一個本地文件. 如果要在一個分佈式系統(例如HDFS)中讀取數據,需要在你的構建文件中包含 Hadoop版本信息. Spark默認基於HDFS 1.0.4構建.


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章