更多實例演示:https://github.com/datadevsh/pyspark-api
1.python環境
包括 jupyter 、Python shell 、pycharm 。
1.1.獲取sc對象
from pyspark import SparkConf, SparkContext
val conf = new SparkConf().setMaster("local[3]").setAppName("als").set("spark.executor.memory","10g")
val sc = SparkContext.getOrCreate(conf)
sc.setLogLevel("ERROR")
# .set('spark.driver.host','txy').set('spark.local.ip','txy')
# val spark = SparkSession.builder.getOrCreate()
1.2.讀取文件
lines = sc.textFile("D:/ML/python-design/ml-10M100K/ratings.dat")
1.3. 切割字符串
parts = lines.map(lambda row: row.split("::"))
1.4.創建dataframe / 切割數據集
前提:
from pyspark.sql import Row
ratingsRDD = parts.map(lambda p: Row(userId=int(p[0]), movieId=int(p[1]),
rating=float(p[2])))
(training, test) = ratingsRDD.randomSplit([0.8, 0.2])
1.5. 模型對象初始化和訓練
from pyspark.ml.recommendation import ALS
model = ALS.train(training, rank=50, iterations=10, lambda_=0.01)
1.6.讀取文件夾目錄下所有文件
import os
flag = True
if os.path.isdir(inputFile): # 如果路徑是文件夾
files = os.listdir(inputFile) # 得到文件夾下的所有文件名稱
for file in files: # 遍歷文件夾
if (flag):
lines = sc.textFile(inputFile + file)
flag = False
else:
lines = sc.textFile(inputFile + file).union(lines) # 把所有文件做並集
else:
lines = sc.textFile(inputFile) # 如果路徑是一個文件
1.7.pyspark DataFrame遍歷
1.7.1.遍歷
for i in arr:
print(i)
1.7.2.遍歷ResultIterable
1.7.2.1.父集合是list類型
user_item_hist # list類型
[('uhf34sdcfe3', <pyspark.resultiterable.ResultIterable at 0x2d40515a438>),
('dsfcds2332f', <pyspark.resultiterable.ResultIterable at 0x2d40515a470>)]
遍歷ResultIterable 裏面的值
for x in user_item_hist:
print(x[0],list(x[1]))
1.7.2.2.父集合是PythonRDD類型
1. 寫一個文件 user.py
class User:
def __init__(self, line):
self.user_id = line[0]
self.location = line[1]
2.引入
from user import User
def create_user(line):
user = User(line)
return user
3.轉換、遍歷
for user in user_item_pairs.map(lambda entry: create_user(entry)).collect():
print(user.user_id,list(user.location))
2.pyspark shell環境
可見,這裏的spark對象來自SparkSession,所以和來自SparkContext的sc用法不太一樣。
1.1. 獲取sc對象
可以直接使用“spark”來操作
1.2.讀取文件
lines = spark.read.text("ratings.dat").rdd
1.3. 切割字符串
parts = lines.map(lambda row: row.value.split("\001"))
1.4.創建dataframe / 切割數據集
前提:
from pyspark.sql import Row
ratingsRDD = parts.map(lambda p: Row(userId=int(p[0]), movieId=int(p[1]),
rating=float(p[2])))
ratings = spark.createDataFrame(ratingsRDD)
(training, test) = ratings.randomSplit([0.8, 0.2])
1.5. 模型對象初始化和訓練
from pyspark.ml.recommendation import ALS
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
coldStartStrategy="drop")
model = als.fit(training)