環境:
Ubuntu19.10
anaconda3-python3.6.10
scala 2.11.8
apache-hive-3.0.0-bin
hadoop-2.7.7
spark-2.3.1-bin-hadoop2.7
java version "1.8.0_131"
Mysql Server version: 8.0.19-0ubuntu0.19.10.3 (Ubuntu)
driver:mysql-connector-java-8.0.20.jar
[Driver link|https://mvnrepository.com/artifact/mysql/mysql-connector-java/8.0.20]
使用的代碼是:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
def map_extract(element):
file_path, content = element
year = file_path[-8:-4]
return [(year, i) for i in content.split("\n") if i]
spark = SparkSession\
.builder\
.appName("PythonTest")\
.getOrCreate()
res = spark.sparkContext.wholeTextFiles('hdfs://Desktop:9000/user/mercury/names',
minPartitions=40) \
.map(map_extract) \
.flatMap(lambda x: x) \
.map(lambda x: (x[0], int(x[1].split(',')[2]))) \
.reduceByKey(lambda x,y:x+y)
df = res.toDF(["key","num"]) #把已有數據列改成和目標mysql表的列的名字相同
# print(dir(df))
df.printSchema()
print(df.show())
df.printSchema()
df.write.format("jdbc").options(
url="jdbc:mysql://127.0.0.1:3306/leaf",
driver="com.mysql.cj.jdbc.Driver",
dbtable="spark",
user="appleyuchi",
password="appleyuchi").mode('append').save()
提交方式是(下面兩種方式都能復現bug):
①pyspark --master yarn(然後在交互是模式中輸入交互式代碼)
②spark-submit --master yarn --deploy-mode cluster 源碼.py
③pyspark --master yarn --conf spark.executor.extraClassPath=/home/appleyuchi/bigdata/apache-hive-3.0.0-bin/lib/mysql-connector-java-8.0.20.jar
同樣會報告類似的錯誤
目前,pyspark無論是交互模式還是spark-submit提交python文件,採用spark-defaults.conf中設置spark.jars的方法都是不行的,都會報相同的錯誤。
唯一的辦法就是設置spark.driver.extraClassPath和spark.executor.extraClassPath,然後以cluster模式進行spark-submit可以保證沒有bug
這個是個bug,已經在官方論壇提交:
https://issues.apache.org/jira/browse/SPARK-31629
該bug在spark的1.4的時候出現過。
在修復以前,唯一的辦法就是不要使用SQLContext而是使用HiveContext.
這個bug的特點就是都出現在python接口。
#------------------------------------------------------------------------------------------------------------------------------------
[1]中寫入mysql的dataframe類型是:DataFrame[id: bigint, value: bigint]
上述報錯中,試圖寫入mysql的dataframe類型是:<class 'pyspark.sql.dataframe.DataFrame'>
既然沒法寫入,我們研究下能不能讀出來,以及讀出來的變量類型是啥?
$ pyspark --master yarn
Python 3.6.10 |Anaconda, Inc.| (default, Jan 7 2020, 21:14:29)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.1
/_/
Using Python version 3.6.10 (default, Jan 7 2020 21:14:29)
SparkSession available as 'spark'.
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import SQLContext
>>> sc = SparkSession.builder.appName("Python Spark SQL basic example")\
>>> jdbcDf=ctx.read.format("jdbc").options(url="jdbc:mysql://localhost:3306/leaf",driver="com.mysql.jdbc.Driver",dbtable="(SELECT * FROM spark) tmp",user="appleyuchi",password="appleyuchi").load()
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
>>> type(jdbcDf)
<class 'pyspark.sql.dataframe.DataFrame'>
>>>
這個驅動從mysql是可讀的,讀出來的類型也是<class 'pyspark.sql.dataframe.DataFrame'>
jdbcDf=ctx.read.format("jdbc").options(url="jdbc:mysql://localhost:3306/leaf",driver="com.mysql.jdbc.Driver",dbtable="(SELECT * FROM spark)
dir(ctx)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_inferSchema', '_instantiatedContext', '_jsc', '_jsqlContext', '_jvm', '_sc', '_ssql_ctx', 'cacheTable', 'clearCache', 'createDataFrame', 'createExternalTable', 'dropTempTable', 'getConf', 'getOrCreate', 'newSession', 'range', 'read', 'readStream', 'registerDataFrameAsTable', 'registerFunction', 'registerJavaFunction', 'setConf', 'sparkSession', 'sql', 'streams', 'table', 'tableNames', 'tables', 'udf', 'uncacheTable']
說明個什麼問題呢?
spark對於mysql的讀寫操作中,
讀出的是:<class 'pyspark.sql.dataframe.DataFrame'>(這種類型無法再次寫入)
寫入的是:DataFrame[id: bigint, value: bigint]
Reference:
[1]https://zhuanlan.zhihu.com/p/136777424