使用pyspark SQL處理MySQL中的數據

pyspark連接mysql

from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F


sc = SparkContext("local", appName="mysqltest")
sqlContext = SQLContext(sc)
df = sqlContext.read.format("jdbc").options(
    url="jdbc:mysql://localhost:3306/mydata?user=root&password=mysql&"
        "useUnicode=true&characterEncoding=utf-8&useJDBCCompliantTimezoneShift=true&"
        "useLegacyDatetimeCode=false&serverTimezone=UTC ", dbtable="detail_data").load()
df.show(n=5)
sc.stop()

pyspark SQL常用語法

pyspark SQL的部分語法和pandas的很相似。

輸出schema

df.printSchema()
# root
#  |-- id: integer (nullable = true)
#  |-- 省份: string (nullable = true)

預覽表

df.show(n=5)
# +----+------+------+------+------+
# |  id|  省份|  城市|  區縣|  區域|
# +----+------+------+------+------
# |2557|廣東省|深圳市|羅湖區|春風路
# ...

統計數量

print(df.count())
# 47104

輸出列名稱和字段類型

print(df.columns)
# ['id', '省份', '城市', '區縣', '區域', '小區', '源地址',...
print(df.dtypes)
# [('id', 'int'), ('省份', 'string'), ('城市', 'string'),...

選擇列

df.select('城市', '區縣', '區域', '小區').show()
# +------+------+------+--------------+
# |  城市|  區縣|  區域|          小區|
# +------+------+------+--------------+
# |深圳市|羅湖區|春風路|      凱悅華庭|
# |深圳市|羅湖區|春風路|      置地逸軒|
# ...

爲選擇的列賦予新名稱

可以看到有兩種方式來對指定列做操作：

列名稱是英文的話，直接在df後面用點號調用
列名稱非英文，可以在後面用中括號調用

df.select(df.id.alias('id_value'), '小區').show()
# +--------+--------------+
# |id_value|          小區|
# +--------+--------------+
# |    2557|      凱悅華庭|
# |    2558|      置地逸軒|
# ...

df.select(df["城市"].alias('city'), '小區').show()
# +------+--------------+
# |  city|          小區|
# +------+--------------+
# |深圳市|      凱悅華庭|
# |深圳市|      置地逸軒|
# ...

按條件過濾

在filter中定義過濾條件，如果要進行多條件過濾，使用以下符號：

&   代表 and 
|   代表or
~   代表not

df.select('城市', '區縣', '區域', '小區').filter(df["小區"] == '凱悅華庭').show()
# +------+------+------+--------+
# |  城市|  區縣|  區域|    小區|
# +------+------+------+--------+
# |深圳市|羅湖區|春風路|凱悅華庭|
# ...

df.select('城市', '區縣', '區域', '小區').filter((df["城市"] == '深圳市') & (df["區縣"] == '南山區')).show()
# +------+------+------+----------------+
# |  城市|  區縣|  區域|            小區|
# +------+------+------+----------------+
# |深圳市|南山區|白石洲|中海深圳灣畔花園|
# |深圳市|南山區|白石洲|    僑城豪苑二期|
# ...

# 可以直接在filter裏面寫條件表達式
df.select('id', '城市', '區縣', '區域', '小區').filter("id = 5000").show()

構造新列

使用以下兩種方式構造新列，其中使pyspark.sql.functions下提供很多sql操作函數。
我們使用functions下的lit構造新列。

import pyspark.sql.functions as F

df.select(df["城市"] + 1, '城市', '小區').show()
# +----------+------+--------------+
# |(城市 + 1)|  城市|          小區|
# +----------+------+--------------+
# |      null|深圳市|      凱悅華庭|

df.select(F.lit("test").alias('城市plus'), '城市', '小區').show()
# +--------+------+--------------+
# |城市plus|  城市|          小區|
# +--------+------+--------------+
# |    test|深圳市|      凱悅華庭|

增加行

# 取出一行
df2 = df.limit(1)  # one row

# 增加行
print(df.count())  # 47104
print(df.unionAll(df2).count())  # 47105

刪除重複記錄

我們上面增加了一個重複行，目前條數是47105，我們接下來看去除重複行之後47104。
另外一個問題，對於原始數據來說，我們的id是不重複的，但是其他字段可能會重複。
因此我們去重的時候應該去掉id的影響，這裏可以使用下面第二種指定字段去重。

# 刪除重複記錄
print(df.drop_duplicates().count())  # 47104

print(df.drop_duplicates(['省份', '城市', '區縣', '區域', '小區']).count())

刪除列

# 刪除列
print(df.drop('id').columns)
# ['省份', '城市', '區縣', '區域', '小區', '源地址',...

刪除缺失值行

# 刪除存在缺失的記錄
print(df.dropna().count())  # 47092
# 刪除指定字段中存在缺失的記錄
print(df.dropna(subset=['省份', '城市']).count())  # 47104

填充缺失值

填充缺失值，指定字段需要填充的值。

print(df.fillna({'省份': "廣東省", '城市': '深圳市'}))

分組統計和計算

# 分組統計
df_g1 = df.groupby("區縣").count()
df_g1.show()
# +--------+-----+
# |    區縣|count|
# +--------+-----+
# |  龍華區| 4217|
# ...

# 分組計算
df.groupby('區縣').agg(F.max(df['總價'])).show()
+--------+----------+
# |    區縣| max(總價)|
# +--------+----------+
# |  龍華區|5200.00000|
# |  福田區|8300.00000|
# |  羅湖區|7000.00000|
# |  坪山區|1588.00000|
# |  南山區|9800.00000|
# |  龍崗區|4000.00000|
# |  鹽田區|5500.00000|
# |  光明區|5200.00000|
# |大鵬新區|3500.00000|
# |  寶安區|8800.00000|
# +--------+----------+

函數計算

同樣，函數計算由pyspark.sql.functions提供。

# 函數計算
df.select(F.max(df["總價"])).show()  # 最大值
# +----------+
# | max(總價)|
# +----------+
# |9800.00000|
# +----------+

df.select(F.min(df["總價"])).show()  # 最小值
# +---------+
# |min(總價)|
# +---------+
# |  1.10000|
# +---------+

df.select(F.avg(df["總價"])).show()  # 平均值
# +-------------+
# |    avg(總價)|
# +-------------+
# |577.736916000|
# +-------------+

df.select(F.countDistinct(df["總價"])).show()  # 去重後再統計
#  |count(DISTINCT 總價)|
# +--------------------+
# |                1219|
# +--------------------+

df.select(F.count(df["總價"])).show()  # 去掉缺失值會再統計
# +-----------+
# |count(總價)|
# +-----------+
# |      47104|
# +-----------+

其餘的函數：

# 'lit': 'Creates a :class:`Column` of literal value.',
# 'col': 'Returns a :class:`Column` based on the given column name.',
# 'column': 'Returns a :class:`Column` based on the given column name.',
# 'asc': 'Returns a sort expression based on the ascending order of the given column name.',
# 'desc': 'Returns a sort expression based on the descending order of the given column name.',
#
# 'upper': 'Converts a string expression to upper case.',
# 'lower': 'Converts a string expression to upper case.',
# 'sqrt': 'Computes the square root of the specified float value.',
# 'abs': 'Computes the absolutle value.',
#
# 'max': 'Aggregate function: returns the maximum value of the expression in a group.',
# 'min': 'Aggregate function: returns the minimum value of the expression in a group.',
# 'first': 'Aggregate function: returns the first value in a group.',
# 'last': 'Aggregate function: returns the last value in a group.',
# 'count': 'Aggregate function: returns the number of items in a group.',
# 'sum': 'Aggregate function: returns the sum of all values in the expression.',
# 'avg': 'Aggregate function: returns the average of the values in a group.',
# 'mean': 'Aggregate function: returns the average of the values in a group.',
# 'sumDistinct': 'Aggregate function: returns the sum of distinct values in the expression.',

描述性分析

df.describe("總價").show()
# +-------+-----------------+
# |summary|             總價|
# +-------+-----------------+
# |  count|            47104|
# |   mean|    577.736916000|
# | stddev|544.7605196104298|
# |    min|          1.10000|
# |    max|       9800.00000|
# +-------+-----------------+

最後，記得stop一下。

sc.stop()

參考鏈接

Quick Start
Spark Python API (Sphinx)
PySpark SQL常用語法
 spark官方文檔翻譯之pyspark.sql.SQLContext

使用pyspark SQL處理MySQL中的數據

目錄

pyspark連接mysql

pyspark SQL常用語法

輸出schema

預覽表

統計數量

輸出列名稱和字段類型

選擇列

爲選擇的列賦予新名稱

按條件過濾

構造新列

增加行

刪除重複記錄

刪除列

刪除缺失值行

填充缺失值

分組統計和計算

函數計算

描述性分析

參考鏈接

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

基於django開發下載excel文件的接口

使用python&pandas讀取hive數據

使用信用卡數據開發信貸評分卡

使用pyspark SQL處理MySQL中的數據

Ubuntu中Matplotlib繪圖的中文亂碼

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結