py4j.protocol.Py4JJavaError: An error occurred while calling o90.save(官方bug,目前沒有解決,還沒寫完)

環境:

Ubuntu19.10

anaconda3-python3.6.10

scala 2.11.8

apache-hive-3.0.0-bin

hadoop-2.7.7

spark-2.3.1-bin-hadoop2.7

java version "1.8.0_131"

Mysql Server version: 8.0.19-0ubuntu0.19.10.3 (Ubuntu)

driver:mysql-connector-java-8.0.20.jar

[Driver link|https://mvnrepository.com/artifact/mysql/mysql-connector-java/8.0.20]

使用的代碼是:

import pandas as pd
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext

def map_extract(element):
    file_path, content = element
    year = file_path[-8:-4]
    return [(year, i) for i in content.split("\n") if i]


spark = SparkSession\
    .builder\
    .appName("PythonTest")\
    .getOrCreate()

    
res = spark.sparkContext.wholeTextFiles('hdfs://Desktop:9000/user/mercury/names',
                        minPartitions=40)  \
        .map(map_extract) \
        .flatMap(lambda x: x) \
        .map(lambda x: (x[0], int(x[1].split(',')[2]))) \
        .reduceByKey(lambda x,y:x+y)



df = res.toDF(["key","num"])  #把已有數據列改成和目標mysql表的列的名字相同
# print(dir(df))
df.printSchema()
print(df.show())
df.printSchema()

df.write.format("jdbc").options(
    url="jdbc:mysql://127.0.0.1:3306/leaf",
    driver="com.mysql.cj.jdbc.Driver",
    dbtable="spark",
    user="appleyuchi",
    password="appleyuchi").mode('append').save()

 

 

提交方式是(下面兩種方式都能復現bug):

①pyspark --master yarn(然後在交互是模式中輸入交互式代碼)

②spark-submit --master yarn --deploy-mode cluster 源碼.py

③pyspark --master yarn --conf spark.executor.extraClassPath=/home/appleyuchi/bigdata/apache-hive-3.0.0-bin/lib/mysql-connector-java-8.0.20.jar

同樣會報告類似的錯誤

目前,pyspark無論是交互模式還是spark-submit提交python文件,採用spark-defaults.conf中設置spark.jars的方法都是不行的,都會報相同的錯誤。

 

唯一的辦法就是設置spark.driver.extraClassPath和spark.executor.extraClassPath,然後以cluster模式進行spark-submit可以保證沒有bug
 

 

這個是個bug,已經在官方論壇提交:
https://issues.apache.org/jira/browse/SPARK-31629

該bug在spark的1.4的時候出現過。

在修復以前,唯一的辦法就是不要使用SQLContext而是使用HiveContext.

這個bug的特點就是都出現在python接口。

 

#------------------------------------------------------------------------------------------------------------------------------------
 

[1]中寫入mysql的dataframe類型是:DataFrame[id: bigint, value: bigint]

上述報錯中,試圖寫入mysql的dataframe類型是:<class 'pyspark.sql.dataframe.DataFrame'>

 

既然沒法寫入,我們研究下能不能讀出來,以及讀出來的變量類型是啥?

$ pyspark --master yarn
Python 3.6.10 |Anaconda, Inc.| (default, Jan  7 2020, 21:14:29) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Python version 3.6.10 (default, Jan  7 2020 21:14:29)
SparkSession available as 'spark'.
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import SQLContext
>>> sc = SparkSession.builder.appName("Python Spark SQL basic example")\
>>> jdbcDf=ctx.read.format("jdbc").options(url="jdbc:mysql://localhost:3306/leaf",driver="com.mysql.jdbc.Driver",dbtable="(SELECT * FROM spark) tmp",user="appleyuchi",password="appleyuchi").load()
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
>>> type(jdbcDf)
<class 'pyspark.sql.dataframe.DataFrame'>
>>> 

這個驅動從mysql是可讀的,讀出來的類型也是<class 'pyspark.sql.dataframe.DataFrame'>

jdbcDf=ctx.read.format("jdbc").options(url="jdbc:mysql://localhost:3306/leaf",driver="com.mysql.jdbc.Driver",dbtable="(SELECT * FROM spark) 

 dir(ctx)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_inferSchema', '_instantiatedContext', '_jsc', '_jsqlContext', '_jvm', '_sc', '_ssql_ctx', 'cacheTable', 'clearCache', 'createDataFrame', 'createExternalTable', 'dropTempTable', 'getConf', 'getOrCreate', 'newSession', 'range', 'read', 'readStream', 'registerDataFrameAsTable', 'registerFunction', 'registerJavaFunction', 'setConf', 'sparkSession', 'sql', 'streams', 'table', 'tableNames', 'tables', 'udf', 'uncacheTable']

說明個什麼問題呢?

spark對於mysql的讀寫操作中,

讀出的是:<class 'pyspark.sql.dataframe.DataFrame'>(這種類型無法再次寫入)

寫入的是:DataFrame[id: bigint, value: bigint]

 

Reference:

[1]https://zhuanlan.zhihu.com/p/136777424

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章