Similar to reduce() but used to return a different type. Accept three parameters:
zeroValue : the initial value for the accumulated result of each partition for the seqOp operator,and also the initial value for the combine results from different partitions for the combOpoperator
seqOp : an operator used to accumulate results within a partition
combOp: an associative operator used to combine results from different partitions
nums = sc.parallelize([12, 2, 6, 2, 12, 2])
nums.reduce(lambda x,y:x+y)
結果36
Nums.aggregate(0,lambda x,y:x+y,lambda x,y:x+y)
結果36
源碼解釋網址:http://spark.apache.org/docs/2.2.1/api/python/pyspark#pyspark.RDD
Nums.Aggregate((0,0),lambdax,y:(x[0]+y,x[1]+1,lambda x,y:(x[0]+y[0],x[1]+y[1]))
Nums.repartition(1).aggregate(100,lambdax,y:x+y,lambda x,y:x+y)
結果216
Reparation是合理分區,空間相等。
爲何數據不平衡?利用哈希算法 ,數據清洗。(除3餘0,代表數據相同就會放在同一個區裏)
RDD元素取值操作
take(n) : Return n elements from the RDD
top(num): Return num elements from the RDD.
first(): Return the first element
collect(): Return all elements from the RDD
foreach(func): Apply the provided function to each elementof the RDD
takeSample(withReplacement, num, [seed])
RDD取值例子
nums = sc.parallelize([12, 2, 6, 2, 12, 2])
print (nums.top(3))
print (nums.take(2))
print (nums.first())
12,12,6
12,2
12,
騰訊共收了多少錢?
lines.map(lambda x:int(x,split(‘,’)[2])).reduce(lambda x,y:x+y)
#每個人在每個區共花了多少錢?按區進行降序,姓名升序
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()
class Reversinator1(object):#將第二個比較的規則更改了
def __init__(self, obj):
self.obj = obj
def __lt__(self, other):
if self.obj[0]==other.obj[0]:
return self.obj[1]<other.obj[1]
else:
return self.obj[0]>other.obj[0]
rs=sorted(rdd,key=lambda x:Reversinator1(x[0]))
print(rs)
#每個人在每個區共花了多少錢,按區降序排列,每個相同區中花的錢按升序排?
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()
class Reversinator1(object):#將第二個比較的規則更改了
def __init__(self, obj):
self.obj = obj
def __lt__(self, other):
if self.obj[0][0]==other.obj[0][0]:
return self.obj[1]<other.obj[1]
else:
return self.obj[0][0]>other.obj[0][0]
rs=sorted(rdd,key=lambda x:Reversinator1(x))
print(rs)
#每個人在每個區共花了多少錢?用組合key
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()
#先按區排序,區相同的再按名字進行排序,用python排序
rs=sorted(rdd)
print(rs)
#降序
rs=sorted(rdd,reverse=True)
print(rs)
class Reversinator1(object):#將第二個比較的規則更改了
def __init__(self, obj):
self.obj = obj
def __lt__(self, other):
return other.obj < self.obj
class Reversinator2(object):#將第二個比較的規則更改了
def __init__(self, obj):
self.obj = obj
def __lt__(self, other):
return other.obj < self.obj
rs=sorted(rs,key=lambda x:(x[0][0],Reversinator2(x[0][1])),reverse=True)
print(rs)