題目:
該題目需要生成一個數據集,用三種不同的機器學習算法進行訓練,由此來評估這三種算法的性能(根據給出的標準)。根據提示的步驟來一步步實現。
1.人工生成數據集
根據給出的參數提示,生成數據集。
參考:
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html
# 1.Create a classification dataset (n_samples >= 1000, n_features >= 10)
dataset = datasets.make_classification(n_samples=1000, n_features=10)
PS:load數據集:
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html
# load a default dataset
from sklearn import datasets
iris = datasets.load_iris()
2.利用10—fold cross validation劃分數據集
參考:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
# 2.Split the dataset using 10-fold cross validation
kf = cross_validation.KFold(1000, n_folds=10, shuffle=True)
for train_index, test_index in kf:
X_train, y_train = dataset[0][train_index], dataset[1][train_index]
X_test, y_test = dataset[0][test_index], dataset[1][test_index]
3.利用三種方法訓練
a.GaussianNB
PS:Naive Bayes:
參考:
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
# 3.Train the algorithms and evaluate the cross-validated performance
# GaussianNB
GaussianNB_clf = GaussianNB()
GaussianNB_clf.fit(X_train, y_train)
GaussianNB_pred = GaussianNB_clf.predict(X_test)
SVM:
參考:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
# SVM
SVC_clf = SVC(C=1e-01, kernel='rbf', gamma=0.1)
SVC_clf.fit(X_train, y_train)
SVC_pred = SVC_clf.predict(X_test)
Random Forest Classifier:
PS:Random Forest:
參考:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
# Random Forest
Random_Forest_clf = RandomForestClassifier(n_estimators=6)
Random_Forest_clf.fit(X_train, y_train)
Random_Forest_pred = Random_Forest_clf.predict(X_test)
4.性能評估:
Accuracy, F1-score, AUC ROC:
參考:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html
# 4.Evaluate the cross-validated performance
# GaussianNB
GaussianNB_accuracy_score = metrics.accuracy_score(y_test, GaussianNB_pred)
GaussianNB_f1_score = metrics.f1_score(y_test, GaussianNB_pred)
GaussianNB_roc_auc_score = metrics.roc_auc_score(y_test, GaussianNB_pred)
print("GaussianNB:")
print(" Accuracy: ", GaussianNB_accuracy_score)
print(" F1_score: ", GaussianNB_f1_score)
print(" AUC ROC: ", GaussianNB_roc_auc_score)
# SVC
SVC_accuracy_score = metrics.accuracy_score(y_test, SVC_pred)
SVC_f1_score = metrics.f1_score(y_test, SVC_pred)
SVC_roc_auc_score = metrics.roc_auc_score(y_test, SVC_pred)
print("\nSVC:")
print(" Accuracy: ", SVC_accuracy_score)
print(" F1_score: ", SVC_f1_score)
print(" AUC ROC: ", SVC_roc_auc_score)
# Random_Forest
Random_Forest_accuracy_score = metrics.accuracy_score(y_test, Random_Forest_pred)
Random_Forest_f1_score = metrics.f1_score(y_test, Random_Forest_pred)
Random_Forest_roc_auc_score = metrics.roc_auc_score(y_test, Random_Forest_pred)
print("\nRandomForestClassifier:")
print(" Accuracy: ", Random_Forest_accuracy_score)
print(" F1_score: ", Random_Forest_f1_score)
print(" AUC ROC: ", Random_Forest_roc_auc_score)
結果:
通過比較可以看出第三種算法Random Forest Classifier的性能更優。
Documentation and Reference
Documentation:
http://scikit-learn.org/stable/documentation.html
Reference Manual with class descriptions:
http://scikit-learn.org/stable/modules/classes.html