spark 倒排索引

原創

仰望星空_

2018-09-03 23:25

1.實例描述

輸入爲一批文件，文件內容格式如下：

Id1 The Spark

……

Id2 The Hadoop

……

輸出如下：（單詞，文檔ID合併字符串）

The Id1 Id2

Hadoop Id2

……

2.設計思路

先讀取所有文件，數據項爲(文檔ID，文檔詞集合)的RDD，然後將數據映射爲（詞，文檔ID）的RDD，去重，最後在reduceByKey階段聚合每個單詞的文檔ID

３.代碼

import org.apache.spark.{SparkContext, SparkConf}

import org.apache.spark.SparkContext._

import scala.collection.mutable

object InvertedIndex {

def main(args: Array[String]) {

val conf = new SparkConf().setAppName("InvertedIndex").setMaster("local[1]")

val sc = new SparkContext(conf)

val textRdd=sc.textFile("hdfs://master:9000/wordIndex")

val md=textRdd.map(file=>file.split("\t"))

val md2=md.map(item=>{(item(0),item(1))})

val fd=md2.flatMap(file =>{

val words=file._2.split(" ").iterator

val list=mutable.LinkedList[(String,String)]((words.next(),file._1))

var temp=list

while(words.hasNext){

temp.next=mutable.LinkedList[(String,String)]((words.next,file._1))

temp=temp.next

}

list

})

val result=fd.distinct()

val resRdd=result.reduceByKey(_+" "+_)

resRdd.saveAsTextFile("hdfs://master:9000/InvertIndex")

}

４.說明

其中有如下幾點要注意

rdd flatMap方法定義如下

/**

* Return a new RDD by first applying a function to all elements of this

* RDD, and then flattening the results.

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] =

new FlatMappedRDD(this, sc.clean(f))

方法的參數爲函數，函數輸出類型爲集合(的父類)。它的作用是將這些集合合併爲一個新的集合,但不刪除相同的元素，也不合並rdd中的分區。

reduce 方法定義如下

/**

* Reduces the elements of this RDD using the specified commutative and

* associative binary operator.

def reduce(f: (T, T) => T): T = {

val cleanF = sc.clean(f)

val reducePartition: Iterator[T] => Option[T] = iter => {

if (iter.hasNext) {

Some(iter.reduceLeft(cleanF))

} else {

None

}

var jobResult: Option[T] = None

val mergeResult = (index: Int, taskResult: Option[T]) => {

if (taskResult.isDefined) {

jobResult = jobResult match {

case Some(value) => Some(f(value, taskResult.get))

case None => taskResult

}

sc.runJob(this, reducePartition, mergeResult)

// Get the final result out of our Option, or throw an exception if the RDD was empty

jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))

}

reduce 函數相當於對RDD中的元素進行reduceLeft函數的操作。reduceLeft先對兩個元素<K,V>進行reduce函數操作，然後將結果和迭代器取出的下一個元素<K,V>進行reduce函數操作，直到迭代器遍歷完所有元素，得到最後結果。

在RDD中，先對每個分區中的所有元素<K,V>的集合分別進行reduceLeft。每個分區形成的結果相當於一個元素<K,V>，再對這個結果集合進行reduceLeft操作。

例如：用戶自定義函數如下。

f:(A,B)=>(A._1+"@"+B._1 , A._2+B._2)

如圖：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark 倒排索引

技術揭祕12306改造（二）：探討12306兩地三中心混合雲架構

揭祕12306技術改造（三）：傳統框架雲化遷移到內存數據平臺

技術揭祕12306改造（一）：尖峯日PV值297億下可每秒出票1032張

spark 倒排索引

java nio

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結