Spark機器學習之分類與迴歸

本頁面介紹了分類和迴歸的算法。 它還包括討論特定類別的算法的部分,如線性方法,樹和集合體。

目錄

分類 Classification
-----------邏輯迴歸 Logistic regression
-------------------二項式邏輯迴歸 Binomial logistic regression
-------------------多項Logistic迴歸 Multinomial logistic regression
-----------決策樹分類器 Decision tree classifier
-----------隨機森林分類器 Random forest classifier
-----------梯度增強樹分類器 Gradient-boosted tree classifier
-----------多層感知器分類器 Multilayer perceptron classifier
---------一對一休息分類器(a.k.a.一對全)One-vs-Rest classifier (a.k.a. One-vs-All)
-----------樸素貝葉斯 Naive Bayes
迴歸 Regression
-----------線性迴歸 Linear regression
----------廣義線性迴歸 Generalized linear regression
-----------------可用的家庭 Available families
-----------決策樹迴歸 Decision tree regression
------------隨機森林迴歸 Random forest regression
-----------梯度增強樹迴歸 Gradient-boosted tree regression
-----------生存迴歸 Survival regression
-----------等滲迴歸 Isotonic regression
----------------例子 Examples
線性方法 Linear methods
決策樹 Decision trees
------------輸入和輸出 Inputs and Outputs
------------輸入列 Input Columns
------------輸出列 Output Columns
樹套 Tree Ensembles
隨機森林 Random Forests
-------------輸入和輸出 Inputs and Outputs
-------------輸入列 Input Columns
-------------輸出列(預測)Output Columns (Predictions)
梯度增強樹(GBT)Gradient-Boosted Trees (GBTs)
--------------輸入和輸出 Inputs and Outputs
-------------輸入列 Input Columns
-------------輸出列(預測)Output Columns (Predictions)

1、Classification 

1.1 Logistic regression

邏輯迴歸是預測分類的流行方法。 廣義線性模型的一個特例是預測結果的可能性。 在spark.ml邏輯迴歸中可以使用二項式邏輯迴歸來預測二進制結果,也可以通過使用多項Logistic迴歸來預測多類結果。 使用系列參數在這兩種算法之間進行選擇,或者將其設置爲未設置,Spark將推斷出正確的變體。

通過將family參數設置爲“多項式”,可以將多項Logistic迴歸用於二進制分類。 它將產生兩組係數和兩個截距。

      當在具有常量非零列的數據集上對LogisticRegressionModel進行擬合時,Spark MLlib爲常數非零列輸出零係數。 此行爲與R glmnet相同,但與LIBSVM不同。

1.1.1 Binomial logistic regression

有關二項式邏輯迴歸實現的更多背景和更多細節,請參閱spark.mllib中邏輯迴歸的文檔。


以下示例顯示瞭如何用彈性網絡正則化來訓練二項分類的二項式和多項Logistic迴歸模型。 elasticNetParam對應於α,regParam對應於λ。

from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(training)

# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

# We can also use the multinomial family for binary classification
mlr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial")

# Fit the model
mlrModel = mlr.fit(training)

# Print the coefficients and intercepts for logistic regression with multinomial family
print("Multinomial coefficients: " + str(mlrModel.coefficientMatrix))
print("Multinomial intercepts: " + str(mlrModel.interceptVector))
查找Spark repo中的“examples / src / main / python / ml / logistic_regression_with_elastic_net.py”的完整示例代碼。
邏輯迴歸的spark.ml實現也支持在訓練集中提取模型的摘要。 請注意,在BinaryLogisticRegressionSummary中存儲爲DataFrame的預測和度量標註爲@transient,因此僅在驅動程序上可用。

LogisticRegressionTrainingSummary爲LogisticRegressionModel提供了一個摘要。 目前,只支持二進制分類。 將來會增加對多類別模型摘要的支持。
繼續前面的例子:

from pyspark.ml.classification import LogisticRegression

# Extract the summary from the returned LogisticRegressionModel instance trained
# in the earlier example
trainingSummary = lrModel.summary

# Obtain the objective per iteration
objectiveHistory = trainingSummary.objectiveHistory
print("objectiveHistory:")
for objective in objectiveHistory:
    print(objective)

# Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))

# Set the model threshold to maximize F-Measure
fMeasure = trainingSummary.fMeasureByThreshold
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head()
bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']) \
    .select('threshold').head()['threshold']
lr.setThreshold(bestThreshold)
1.1.2 Multinomial logistic regression

通過多項Logistic(softmax)迴歸支持多類分類。 在多項Logistic迴歸中,該算法產生K係數集,或K×J矩陣,其中KK是結果類的數量,J是特徵數。 如果算法與截距項擬合,則截距的長度K向量是可用的。

多項式係數可用作係數矩陣,截距可作爲interceptVector使用

不支持用多項式族訓練的邏輯迴歸模型的係數和截距方法。 改用係數矩陣和interceptVector。

使用softmax函數建模結果類k∈1,2,...,K的條件概率。


P(Y=k|X,βk,β0k)=eβkX+β0kK1k=0eβkX+β0k


我們使用多項式響應模型將加權負對數似然值最小化,並用彈性網絡懲罰來控制過擬合。


                  minβ,β0[i=1LwilogP(Y=yi|xi)]+λ[12(1α)||β||22+α||β||1]


以下示例展示瞭如何使用彈性網絡正則化來訓練多元邏輯迴歸模型。

from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
    .read \
    .format("libsvm") \
    .load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(training)

# Print the coefficients and intercept for multinomial logistic regression
print("Coefficients: \n" + str(lrModel.coefficientMatrix))
print("Intercept: " + str(lrModel.interceptVector))

1.2 Decision tree classifier

決策樹是一種流行的分類和迴歸方法。 有關spark.ml實現的更多信息,請參見決策樹部分。


以下示例以LibSVM格式加載數據集,將其拆分爲訓練和測試集,在第一個數據集上訓練,然後對所保留的測試集進行評估。 我們使用兩個特徵轉換來準備數據; 這些幫助索引類別的標籤和分類功能,添加元數據到決策樹算法可以識別的DataFrame。

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load the data stored in LIBSVM format as a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))

treeModel = model.stages[2]
# summary only
print(treeModel)

1.3 Random forest classifier

隨機森林是一類受Example歡迎的分類和迴歸方法。 有關spark.ml實現的更多信息,請參見隨機林部分。

Example

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)

# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

rfModel = model.stages[2]
print(rfModel)  # summary only
1.4 Gradient-boosted tree classifier

梯度增強樹(GBT)是使用決策樹組合的流行分類和迴歸方法。 有關spark.ml實現的更多信息,請參見GBT部分。

Example

from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GBT model.
gbt = GBTClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxIter=10)

# Chain indexers and GBT in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, gbt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

gbtModel = model.stages[2]
print(gbtModel)  # summary only
1.5 Multilayer perceptron classifier

多層感知器分類器(MLPC)是基於前饋人工神經網絡的分類器。 MLPC由多層節點組成。 每個層完全連接到網絡中的下一層。 輸入層中的節點表示輸入數據。 所有其他節點通過輸入與節點權重w和偏差b的線性組合將輸入映射到輸出,並應用激活功能。 對於具有K + 1層的MLPC,可以以矩陣形式寫出如下:

y(x)=fK(...f2(wT2f1(wT1x+b1)+b2)...+bK)

中間層節點使用Sigmoid(logistic)函數:

f(zi)=11+ezi

輸出層節點使用softmax函數:

f(zi)=eziNk=1ezk

輸出層中的節點數N對應於類的數量。
MLPC採用反向傳播來學習模型。 我們使用物流損失函數進行優化和L-BFGS作爲優化程序。

Example

from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load training data
data = spark.read.format("libsvm")\
    .load("data/mllib/sample_multiclass_classification_data.txt")

# Split the data into train and test
splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

# specify layers for the neural network:
# input layer of size 4 (features), two intermediate of size 5 and 4
# and output of size 3 (classes)
layers = [4, 5, 4, 3]

# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

# train the model
model = trainer.fit(train)

# compute accuracy on the test set
result = model.transform(test)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))
1.6 One-vs-Rest classifier (a.k.a. One-vs-All)

OneVsRest是用於執行多類別分類的機器學習簡化的示例,給定可以有效地執行二分類的基本分類器。 它也被稱爲“一對一”。
OneVsRest作爲估計器實現。 對於基類分類器,它需要分類器的實例,併爲每個k類創建二分類問題。 對i類的分類器進行訓練,以預測標籤是否,將i類與所有其他類區分開來。預測是通過評估每個二分類器來完成的,最可靠的分類器的索引作爲標籤輸出。


下面的示例演示如何加載Iris數據集,將其解析爲DataFrame,並使用OneVsRest執行多類分類。 計算出測試誤差來測量算法的準確性。

from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# load data file.
inputData = spark.read.format("libsvm") \
    .load("data/mllib/sample_multiclass_classification_data.txt")

# generate the train/test split.
(train, test) = inputData.randomSplit([0.8, 0.2])

# instantiate the base classifier.
lr = LogisticRegression(maxIter=10, tol=1E-6, fitIntercept=True)

# instantiate the One Vs Rest Classifier.
ovr = OneVsRest(classifier=lr)

# train the multiclass model.
ovrModel = ovr.fit(train)

# score the model on test data.
predictions = ovrModel.transform(test)

# obtain evaluator.
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")

# compute the classification error on test data.
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
1.7 Naive Bayes

樸素貝葉斯分類器是一個簡單概率分類器的家族,基於貝葉斯定理與特徵之間的強烈獨立假設。 spark.ml實現目前支持多項樸素貝葉斯和伯努利天真貝葉斯。 有關更多信息,請參閱MLlib中Naive Bayes部分。

Example

from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load training data
data = spark.read.format("libsvm") \
    .load("data/mllib/sample_libsvm_data.txt")

# Split the data into train and test
splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(train)

# select example rows to display.
predictions = model.transform(test)
predictions.show()

# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))
2 Regression

2.1 Linear regression

使用線性迴歸模型和模型摘要的界面類似於邏輯迴歸案例。
在使用“l-bfgs”求解器的常量非零列的數據集上對LinearRegressionModel進行擬合時,Spark MLlib爲常數非零列輸出零係數。 此行爲與R glmnet相同,但與LIBSVM不同。


以下示例演示了訓練彈性網絡正則化線性迴歸模型並提取模型彙總統計量。

from pyspark.ml.regression import LinearRegression

# Load training data
training = spark.read.format("libsvm")\
    .load("data/mllib/sample_linear_regression_data.txt")

lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(training)

# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))

# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
2.2 Generalized linear regression

與線性迴歸相比,輸出被假設爲跟隨高斯分佈,廣義線性模型(GLM)是線性模型的規範,其中響應變量Yi遵循指數族分佈的一些分佈。 Spark的GeneralizedLinearRegression界面允許靈活的GLM規範,可用於各種類型的預測問題,包括線性迴歸,泊松迴歸,邏輯迴歸等。 目前在spark.ml中,僅支持指數族分佈的一部分,下面列出它們。

注意:Spark當前僅通過其GeneralizedLinearRegression接口支持多達4096個功能,如果超出此約束,則會拋出異常。 有關詳細信息,請參閱高級部分。 然而,對於線性和邏輯迴歸,可以使用線性迴歸和邏輯迴歸估計器來訓練具有增加的特徵數量的模型。

from pyspark.ml.regression import GeneralizedLinearRegression

# Load training data
dataset = spark.read.format("libsvm")\
    .load("data/mllib/sample_linear_regression_data.txt")

glr = GeneralizedLinearRegression(family="gaussian", link="identity", maxIter=10, regParam=0.3)

# Fit the model
model = glr.fit(dataset)

# Print the coefficients and intercept for generalized linear regression model
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))

# Summarize the model over the training set and print out some metrics
summary = model.summary
print("Coefficient Standard Errors: " + str(summary.coefficientStandardErrors))
print("T Values: " + str(summary.tValues))
print("P Values: " + str(summary.pValues))
print("Dispersion: " + str(summary.dispersion))
print("Null Deviance: " + str(summary.nullDeviance))
print("Residual Degree Of Freedom Null: " + str(summary.residualDegreeOfFreedomNull))
print("Deviance: " + str(summary.deviance))
print("Residual Degree Of Freedom: " + str(summary.residualDegreeOfFreedom))
print("AIC: " + str(summary.aic))
print("Deviance Residuals: ")
summary.residuals().show()
2.3 Decision tree regression

from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

# Load the data stored in LIBSVM format as a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeRegressor(featuresCol="indexedFeatures")

# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, dt])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

treeModel = model.stages[1]
# summary only
print(treeModel)
2.4 Random forest regression

from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="indexedFeatures")

# Chain indexer and forest in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

rfModel = model.stages[1]
print(rfModel)  # summary only
2.5 Gradient-boosted tree regression

from pyspark.ml import Pipeline
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GBT model.
gbt = GBTRegressor(featuresCol="indexedFeatures", maxIter=10)

# Chain indexer and GBT in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, gbt])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

gbtModel = model.stages[1]
print(gbtModel)  # summary only
2.6 Survival regression

在spark.ml中,我們實現了加速失效時間(AFT)模型,該模型是用於截尾數據的參數生存迴歸模型。 它描述了生存時間對數的模型,因此通常將其稱爲生存分析的對數線性模型。 與爲同一目的設計的比例危害模型不同,AFT模型更容易並行化,因爲每個實例獨立地有助於目標函數。


from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml.linalg import Vectors

training = spark.createDataFrame([
    (1.218, 1.0, Vectors.dense(1.560, -0.605)),
    (2.949, 0.0, Vectors.dense(0.346, 2.158)),
    (3.627, 0.0, Vectors.dense(1.380, 0.231)),
    (0.273, 1.0, Vectors.dense(0.520, 1.151)),
    (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", "features"])
quantileProbabilities = [0.3, 0.6]
aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
                            quantilesCol="quantiles")

model = aft.fit(training)

# Print the coefficients, intercept and scale parameter for AFT survival regression
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))
print("Scale: " + str(model.scale))
model.transform(training).show(truncate=False)

2.7 Isotonic regression

from pyspark.ml.regression import IsotonicRegression

# Loads data.
dataset = spark.read.format("libsvm")\
    .load("data/mllib/sample_isotonic_regression_libsvm_data.txt")

# Trains an isotonic regression model.
model = IsotonicRegression().fit(dataset)
print("Boundaries in increasing order: %s\n" % str(model.boundaries))
print("Predictions associated with the boundaries: %s\n" % str(model.predictions))

# Makes predictions.
model.transform(dataset).show()


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章