根據水質監測信息預測水質變化趨勢,對水環境的有效防範治理具有重要意義。目前水質預測方法主要分爲兩類,一類爲基於污染物在水環境中的理化過程建立的數值模型,主要包括WASP、QUAL、MIKE等;另一類爲基於數據驅動的機器學習方法及深度學習方法,主要包括LSTM、adaboost、隨機森林等。本文基於spark分佈式計算框架實現隨機森林算法進行水質預測。
1、準備數據
將數據上傳到HDFS分佈式文件系統上,再利用hive建立外部表,建表語句如下:
create external table wayeal_forecast.water (`time` string COMMENT 'from deserializer',
`id` string COMMENT 'from deserializer',
`name` STRING COMMENT 'from deserializer',
`basin` string COMMENT 'from deserializer',
`section` STRING COMMENT 'from deserializer',
`ph` STRING COMMENT 'from deserializer',
`ph_type` string COMMENT 'from deserializer',
`do` string COMMENT 'from deserializer',
`do_type` string COMMENT 'from deserializer',
`nh3_n` string COMMENT 'from deserializer',
`nh3_n_type` STRING COMMENT 'from deserializer',
`codmn` STRING COMMENT 'from deserializer',
`codmn_type` STRING COMMENT 'from deserializer',
`c` STRING COMMENT 'from deserializer',
`c_type` STRING COMMENT 'from deserializer')
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ","
)
部分數據如下所示:
2、模型開發
首先,從hive中讀取數據:
data = self.spark.sql(HIVE_SQL).select(WATER_FACTOR)
data1 = data.filter(data['id'] == '78')
然後,由於原始數據爲時間序列數據,需將其轉換成監督學習數據,代碼如下:
data = data.withColumn("id", monotonically_increasing_id())
for colName in SELECT_WATER_FACTOR:
for i in range(1, n_hours + 1, 1):
w = Window.orderBy("id")
data = data.withColumn("{}(t-{})".format(colName, i), lag(colName, i).over(w))
data = data.na.drop()
data = data.drop("id")
最後,利用pipeline封裝整個算法流程,並基於ParamGridBuilder及TrainValidationSplit實現網格搜索進行模型調優。代碼如下:
(train_data, test_data) = data.randomSplit([0.7, 0.3])
data_col = data.columns
data_col.remove('time')
input_cols = [col for col in data_col if col not in SELECT_WATER_FACTOR]
vector_assembler = VectorAssembler(inputCols=input_cols, outputCol="featureVector")
rf_regressor = RandomForestRegressor()\
.setFeaturesCol("featureVector")\
.setLabelCol("ph")\
.setPredictionCol("prediction")
param_grid = ParamGridBuilder()\
.addGrid(rf_regressor.numTrees, [10, 50, 100, 150, 200, 500])\
.build()
pipeline = Pipeline(stages=[vector_assembler, rf_regressor])
# model = pipeline.fit(train_data)
# predictions = model.transform(test_data)
evaluator = RegressionEvaluator(labelCol="ph", predictionCol="prediction", metricName="rmse")
validator = TrainValidationSplit()\
.setEstimator(pipeline)\
.setEstimatorParamMaps(param_grid)\
.setEvaluator(evaluator)\
.setTrainRatio(0.9)
validator_model = validator.fit(train_data)
best_model = validator_model.bestModel
predictions = best_model.transform(test_data)
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
rf_model = best_model.stages[1]
print(rf_model)
predictions.show(truncate=False)
最優模型結果如下:
Root Mean Squared Error (RMSE) on test data = 0.202249
RandomForestRegressionModel (uid=RandomForestRegressor_8bde32f77a3e) with 150 trees