談談機器學習模型的部署 原 薦

隨着機器學習的廣泛應用,如何高效的把訓練好的機器學習的模型部署到生產環境,正在被越來越多的工具所支持。我們今天就來看一看不同的工具是如何解決這個問題的。

上圖的過程是一個數據科學項目所要經歷的典型的過程。從數據採集開始,經歷數據分析,數據變形,數據驗證,數據拆分,訓練,模型創建,模型驗證,大規模訓練,模型發佈,到提供服務,監控和日誌。諸多的機器學習工具如Scikt-Learn,Spark, Tensorflow, MXnet, PyTorch提供給數據科學家們不同的選擇,同時也給模型的部署帶來了不同的挑戰。

我們先來簡單的看一看機器學習的模型是如何部署,它又會遇到那些挑戰。

模型持久化

模型部署一般就是把訓練的模型持久化,然後運行服務器加載模型,並提供REST或其它形式的服務接口。我們以RandomForestClassification爲例,看一下Sklearn,Spark和Tensorflow是如何持久化模型。

Sklearn

我們使用Iris數據集,利用RandomForestClassifier分類。

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.externals import joblib

data = load_iris()

X, y = data["data"], data["target"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)

print(clf.feature_importances_)

print(classification_report(y_test, clf.predict(
    X_test), target_names=data["target_names"]))

joblib.dump(clf, 'classification.pkl')

訓練的代碼如上。這裏模型導出的代碼在最後一句。joblib.dump(),參考這裏。Sklearn的模型到處本質上是利用Python的Pickle機制。Python的函數進行序列化,也就是說把訓練好的Transformer函數序列化並存爲文件。

要加載模型也很簡單,只要調用joblib.load()就好了。

from sklearn.externals import joblib

from sklearn.datasets import load_iris
from sklearn.metrics import classification_report

data = load_iris()

X, y = data["data"], data["target"]

clf = joblib.load('classification.pkl')

print(clf.feature_importances_)
print(classification_report(y, clf.predict(
    X), target_names=data["target_names"]))

Sklearn對Pickle做了一下封裝和優化,但這並不能解決Pickle本身的一些限制,例如:

  • 版本兼容問題,不同的Python,Pickle,Sklearn的版本,生成的序列化文件並不兼容
  • 安全性問題,例如序列化的文件中被人注入惡意代碼
  • 擴展問題,你自己寫了一個擴展類,無法序列化,或者你在Python中調用了C函數
  • 模型的管理,如果我生成了不同版本的模型,該如何管理

Spark

Spark的Pipeline和Model都支持Save到文件,然後可以很方便的在另一個Context中加載。

訓練的代碼如下:

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

from pyspark.sql.types import DoubleType

from pyspark import SparkFiles
from pyspark import SparkContext

url = "https://server/iris.csv"

spark.sparkContext.addFile(url)

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.csv(SparkFiles.get("iris.csv"), header=True)

data = data.withColumn("sepal_length", data["sepal_length"].cast(DoubleType()))
data = data.withColumn("sepal_width", data["sepal_width"].cast(DoubleType()))
data = data.withColumn("petal_width", data["petal_width"].cast(DoubleType()))
data = data.withColumn("petal_length", data["petal_length"].cast(DoubleType()))

#data.show()
data.printSchema()

assembler = VectorAssembler(
    inputCols=["sepal_length", "sepal_width", "petal_width", "petal_length"],
    outputCol="features")

output = assembler.transform(data)

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="species", outputCol="indexedLabel").fit(output)

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(output)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)

# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[assembler, labelIndexer, featureIndexer, rf, labelConverter])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("predictedLabel", "species", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

rfModel = model.stages[3]
print(rfModel)  # summary only

filebase="hdfs://server:9000/tmp"

pipeline.write().overwrite().save("{}/classification-pipeline".format(filebase))
model.write().overwrite().save("{}/classification-model".format(filebase))

模型加載的代碼如下:

%pyspark
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel

from pyspark import SparkFiles

url = "https://server/iris.csv"
spark.sparkContext.addFile(url)

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.csv(SparkFiles.get("iris.csv"), header=True)

data = data.withColumn("sepal_length", data["sepal_length"].cast(DoubleType()))
data = data.withColumn("sepal_width", data["sepal_width"].cast(DoubleType()))
data = data.withColumn("petal_width", data["petal_width"].cast(DoubleType()))
data = data.withColumn("petal_length", data["petal_length"].cast(DoubleType()))

filebase="hdfs://server:9000/tmp/"

pipeline = Pipeline.read().load("{}/classification-pipeline".format(filebase))
model = PipelineModel.read().load("{}/classification-model".format(filebase))

# Make predictions.
predictions = model.transform(data)

# Select example rows to display.
predictions.select("predictedLabel", "species", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

調用model的toDebugString方法可以看到分類器的內部細節。

RandomForestClassificationModel (uid=rfc_225ef4968bf9) with 10 trees
  Tree 0 (weight 1.0):
    If (feature 3 <= 1.9)
     Predict: 1.0
    Else (feature 3 > 1.9)
     If (feature 3 <= 4.7)
      Predict: 2.0
     Else (feature 3 > 4.7)
      If (feature 3 <= 5.1)
       If (feature 1 <= 2.5)
        Predict: 0.0
       Else (feature 1 > 2.5)
        If (feature 1 <= 2.7)
         Predict: 2.0
        Else (feature 1 > 2.7)
         Predict: 0.0
      Else (feature 3 > 5.1)
       Predict: 0.0
  Tree 1 (weight 1.0):
    If (feature 3 <= 1.9)
     Predict: 1.0
    Else (feature 3 > 1.9)
     If (feature 3 <= 4.9)
      If (feature 2 <= 1.6)
       Predict: 2.0
      Else (feature 2 > 1.6)
       If (feature 0 <= 4.9)
        Predict: 0.0
       Else (feature 0 > 4.9)
        If (feature 0 <= 5.9)
         Predict: 2.0
        Else (feature 0 > 5.9)
         Predict: 0.0
     Else (feature 3 > 4.9)
      If (feature 1 <= 3.0)
       If (feature 3 <= 5.1)
        If (feature 2 <= 1.7)
         Predict: 0.0
        Else (feature 2 > 1.7)
         Predict: 0.0
       Else (feature 3 > 5.1)
        Predict: 0.0
      Else (feature 1 > 3.0)
       Predict: 0.0
  Tree 2 (weight 1.0):
    If (feature 3 <= 1.9)
     Predict: 1.0
    Else (feature 3 > 1.9)
     If (feature 3 <= 5.0)
      If (feature 2 <= 1.6)
       Predict: 2.0
      Else (feature 2 > 1.6)
       If (feature 1 <= 2.5)
        Predict: 0.0
       Else (feature 1 > 2.5)
        Predict: 2.0
     Else (feature 3 > 5.0)
      If (feature 0 <= 6.0)
       If (feature 1 <= 2.7)
        If (feature 0 <= 5.8)
         Predict: 0.0
        Else (feature 0 > 5.8)
         Predict: 2.0
       Else (feature 1 > 2.7)
        Predict: 0.0
      Else (feature 0 > 6.0)
       Predict: 0.0
  Tree 3 (weight 1.0):
    If (feature 3 <= 1.9)
     Predict: 1.0
    Else (feature 3 > 1.9)
     If (feature 3 <= 4.9)
      If (feature 2 <= 1.5)
       Predict: 2.0
      Else (feature 2 > 1.5)
       If (feature 2 <= 1.7)
        Predict: 0.0
       Else (feature 2 > 1.7)
        Predict: 2.0
     Else (feature 3 > 4.9)
      If (feature 3 <= 5.1)
       If (feature 0 <= 6.5)
        If (feature 0 <= 5.9)
         Predict: 0.0
        Else (feature 0 > 5.9)
         Predict: 0.0
       Else (feature 0 > 6.5)
        Predict: 2.0
      Else (feature 3 > 5.1)
       Predict: 0.0
  Tree 4 (weight 1.0):
    If (feature 2 <= 0.5)
     Predict: 1.0
    Else (feature 2 > 0.5)
     If (feature 2 <= 1.5)
      If (feature 2 <= 1.4)
       Predict: 2.0
      Else (feature 2 > 1.4)
       If (feature 3 <= 4.9)
        Predict: 2.0
       Else (feature 3 > 4.9)
        Predict: 0.0
     Else (feature 2 > 1.5)
      If (feature 2 <= 1.8)
       If (feature 3 <= 5.0)
        If (feature 0 <= 4.9)
         Predict: 0.0
        Else (feature 0 > 4.9)
         Predict: 2.0
       Else (feature 3 > 5.0)
        Predict: 0.0
      Else (feature 2 > 1.8)
       Predict: 0.0
  Tree 5 (weight 1.0):
    If (feature 2 <= 0.5)
     Predict: 1.0
    Else (feature 2 > 0.5)
     If (feature 2 <= 1.6)
      If (feature 2 <= 1.3)
       Predict: 2.0
      Else (feature 2 > 1.3)
       If (feature 3 <= 4.9)
        Predict: 2.0
       Else (feature 3 > 4.9)
        Predict: 0.0
     Else (feature 2 > 1.6)
      If (feature 3 <= 4.8)
       If (feature 2 <= 1.7)
        Predict: 0.0
       Else (feature 2 > 1.7)
        Predict: 2.0
      Else (feature 3 > 4.8)
       Predict: 0.0
  Tree 6 (weight 1.0):
    If (feature 3 <= 1.9)
     Predict: 1.0
    Else (feature 3 > 1.9)
     If (feature 3 <= 4.9)
      If (feature 2 <= 1.6)
       Predict: 2.0
      Else (feature 2 > 1.6)
       If (feature 1 <= 2.8)
        Predict: 0.0
       Else (feature 1 > 2.8)
        Predict: 2.0
     Else (feature 3 > 4.9)
      If (feature 1 <= 2.7)
       If (feature 2 <= 1.6)
        If (feature 3 <= 5.0)
         Predict: 0.0
        Else (feature 3 > 5.0)
         Predict: 2.0
       Else (feature 2 > 1.6)
        Predict: 0.0
      Else (feature 1 > 2.7)
       Predict: 0.0
  Tree 7 (weight 1.0):
    If (feature 0 <= 5.4)
     If (feature 2 <= 0.5)
      Predict: 1.0
     Else (feature 2 > 0.5)
      Predict: 2.0
    Else (feature 0 > 5.4)
     If (feature 2 <= 1.7)
      If (feature 3 <= 1.5)
       Predict: 1.0
      Else (feature 3 > 1.5)
       If (feature 0 <= 6.9)
        If (feature 3 <= 5.0)
         Predict: 2.0
        Else (feature 3 > 5.0)
         Predict: 0.0
       Else (feature 0 > 6.9)
        Predict: 0.0
     Else (feature 2 > 1.7)
      If (feature 0 <= 5.9)
       If (feature 2 <= 1.8)
        Predict: 2.0
       Else (feature 2 > 1.8)
        Predict: 0.0
      Else (feature 0 > 5.9)
       Predict: 0.0
  Tree 8 (weight 1.0):
    If (feature 3 <= 1.7)
     Predict: 1.0
    Else (feature 3 > 1.7)
     If (feature 3 <= 5.1)
      If (feature 2 <= 1.6)
       If (feature 2 <= 1.4)
        Predict: 2.0
       Else (feature 2 > 1.4)
        If (feature 1 <= 2.2)
         Predict: 0.0
        Else (feature 1 > 2.2)
         Predict: 2.0
      Else (feature 2 > 1.6)
       If (feature 1 <= 2.5)
        Predict: 0.0
       Else (feature 1 > 2.5)
        If (feature 3 <= 5.0)
         Predict: 2.0
        Else (feature 3 > 5.0)
         Predict: 0.0
     Else (feature 3 > 5.1)
      Predict: 0.0
  Tree 9 (weight 1.0):
    If (feature 2 <= 0.5)
     Predict: 1.0
    Else (feature 2 > 0.5)
     If (feature 0 <= 6.1)
      If (feature 3 <= 4.8)
       If (feature 0 <= 4.9)
        If (feature 2 <= 1.0)
         Predict: 2.0
        Else (feature 2 > 1.0)
         Predict: 0.0
       Else (feature 0 > 4.9)
        Predict: 2.0
      Else (feature 3 > 4.8)
       Predict: 0.0
     Else (feature 0 > 6.1)
      If (feature 3 <= 4.9)
       If (feature 1 <= 2.8)
        If (feature 0 <= 6.2)
         Predict: 0.0
        Else (feature 0 > 6.2)
         Predict: 2.0
       Else (feature 1 > 2.8)
        Predict: 2.0
      Else (feature 3 > 4.9)
       Predict: 0.0

下圖是Spark存儲的Piple模型的目錄結構:

我們可以看到,它包含了元數據Pipeline的五個階段的數據,這裏的文件都是二進制的數據,只有Spark自己可以加載。

Tensorflow

最後我們來看一下Tensorflow。Tensorflow提供了tf.train.Saver來導出他的模型到元圖(MetaGraph)。

from __future__ import print_function

import tensorflow as tf
from tensorflow.contrib.tensor_forest.python import tensor_forest
from tensorflow.python.ops import resources

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

import numpy as np

# Ignore all GPUs, tf random forest does not benefit from it.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

data = load_iris()
dX, dy = data["data"], data["target"]
X_train, X_test, y_train, y_test = train_test_split(
    dX, dy, test_size=0.33, random_state=42)

# Parameters
num_steps = 500  # Total steps to train
batch_size = 10  # The number of samples per batch
num_classes = 3  # The 10 digits
num_features = 4  # Each image is 28x28 pixels
num_trees = 10
max_nodes = 100

# Input and Target data
X = tf.placeholder(tf.float32, shape=[None, num_features])
# For random forest, labels must be integers (the class id)
Y = tf.placeholder(tf.int32, shape=[None])

# Random Forest Parameters
hparams = tensor_forest.ForestHParams(num_classes=num_classes,
                                      num_features=num_features,
                                      num_trees=num_trees,
                                      max_nodes=max_nodes).fill()

# Build the Random Forest
forest_graph = tensor_forest.RandomForestGraphs(hparams)
# Get training graph and loss
train_op = forest_graph.training_graph(X, Y)
loss_op = forest_graph.training_loss(X, Y)

# Measure the accuracy
infer_op, _, _ = forest_graph.inference_graph(X)
correct_prediction = tf.equal(tf.argmax(infer_op, 1), tf.cast(Y, tf.int64))
accuracy_op = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Initialize the variables (i.e. assign their default value) and forest resources
init_vars = tf.group(tf.global_variables_initializer(),
                     resources.initialize_resources(resources.shared_resources()))


def next_batch(size):
    index = range(len(X_train))
    index_batch = np.random.choice(index, size)
    return X_train[index_batch], y_train[index_batch]


# Start TensorFlow session
sess = tf.Session()

# Run the initializer
sess.run(init_vars)

saver = tf.train.Saver()

# Training
for i in range(1, num_steps + 1):
    # Prepare Data
    # Get the next batch of MNIST data (only images are needed, not labels)
    batch_x, batch_y = next_batch(batch_size)
    _, l = sess.run([train_op, loss_op], feed_dict={X: batch_x, Y: batch_y})
    if i % 50 == 0 or i == 1:
        acc = sess.run(accuracy_op, feed_dict={X: batch_x, Y: batch_y})
        print('Step %i, Loss: %f, Acc: %f' % (i, l, acc))
# Test Model
print("Test Accuracy:", sess.run(
    accuracy_op, feed_dict={X: X_test, Y: y_test}))

# Print the tensors related to this model
print(accuracy_op)
print(infer_op)
print(X)
print(Y)

# save the model to a check point file
save_path = saver.save(sess, "/tmp/model.ckpt")

導出的模型會包含以下文件:

其中checkpoint是元數據,包含其它文件的路徑信息。還包含了一個Pickle文件和其它幾個checkpiont文件。可以看出,Tensorflow也利用了Python的Pickle機制來存儲模型,並在這之外加入了額外的元數據。

模型加載的代碼如下:

from __future__ import print_function

import tensorflow as tf

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# note: this has to be imported in case to support forest graph
from tensorflow.contrib.tensor_forest.python import tensor_forest

import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

saver = tf.train.import_meta_graph('/tmp/model.ckpt.meta')

data = load_iris()

dX, dy = data["data"], data["target"]

graph = tf.get_default_graph()
with tf.Session() as sess:
    new_saver = tf.train.import_meta_graph('/tmp/model.ckpt.meta')
    new_saver.restore(sess, '/tmp/model.ckpt')
    #input = graph.get_operation_by_name("train")
    # print(graph.as_graph_def())
    load_infer_op = graph.get_tensor_by_name('probabilities:0')
    accuracy_op = graph.get_tensor_by_name('Mean_1:0')
    X = graph.get_tensor_by_name('Placeholder:0')
    Y = graph.get_tensor_by_name('Placeholder_1:0')
    print("Test Accuracy:", sess.run(accuracy_op, feed_dict={X: dX, Y: dy}))
    result = sess.run(load_infer_op, feed_dict={X: dX})
    prediction_result = [i.argmax() for i in result]
    print(classification_report(dy, prediction_result,
                                target_names=data["target_names"]))

這裏要注意的是,RandomForest不是tensforflow的核心包,所以在模型加載的時候必須tensorflow.contrib.tensor_forest.python.tensor_forest, 否則模型是無法成功加載的。因爲不加載的話tensor_forest中定義的一些屬性會缺失。

另外就是Tensorflow也可以存儲計算圖,調用tf.train.write_graph()方法可以把圖定義存儲下來。當然也可以在TesnsorBoard中展示該圖。

tf.train.write_graph(sess.graph_def, '/tmp', 'train.pbtxt')
%cat /tmp/train.pbtxt

 

好了,我們看到,Sklearn,Spark和Tensorflow都提供了自己的模型持久化的方法,那麼簡單來說,只要使用一個web服務器例如Flask,加一些模型加載和管理的方法,然後暴露REST API就可以提供預測服務了,是不是很簡單呢?

其實要在生產環境下提供服務,還需要面對很多其它的挑戰,例如:

  • 在雲上如何擴展和伸縮
  • 如何進行性能調優
  • 如何管理模型的版本
  • 安全性
  • 如何持續集成和持續部署
  • 如何支持AB測試

爲了解決模型部署的挑戰,不同的組織開發了一些開源的工具,例如:ClipperSeldonMFlowMLeapOracle GraphpipeMXnet model server 等等,我們就選其中幾個看個究竟。

Clipper

Clipper是由UC Berkeley RISE Lab 開發的, 在用戶應用和機器學習模型之間的一個提供預測服務的系統,通過解耦合用戶應用和機器學習系統的方式,簡化部署流程。

它有以下功能:

  • 利用簡單標準化的REST接口來簡化機器學習系統的集成,支持主要的機器學習框架。
  • 使用開發模型相同的庫和環境簡化模型部署
  • 利用可適配的Batching,緩存等技術改善吞吐量
  • 通過智能選擇和合並模型來改善預測的準確率

Clipper的架構如下圖:

Clipper使用了容器和微服務技術來構架架構。使用Redis來管理配置,Prometheus來進行監控。Clipper支持使用Kubernetes或者本地的Docker來管理容器。

Clipper支持以下幾種模型:

  • 純Python函數
  • PyShark
  • PyTorch
  • Tensorflow
  • MXnet
  • 自定義

Clipper模型部署的基本過程如下,大家可以參考我的這個notebook

  1. 創建Clipper集羣(使用K8s或者本地Docker)
  2. 創建一個應用
  3. 訓練模型
  4. 調用Clipper提供的模型部署方法部署模型,這裏不同的工具需要調用不同的部署方法。部署時,會把訓練好的Estimator利用CloudPickle之久化,本地構建一個容器鏡像,部署到Docker或者K8s。
  5. 把模型和應用關聯到一起,相當於發佈模型。然後就可以調用對應的REST API來做預測了。

我試着把之前的三種工具的RomdomForest的例子用Clipper發佈到我的Kubernetes集羣,踩到了以下的坑坑:

  • 我本地的Cloudpickle的版本太新,導致模型不能反序列化,參考這個Issue
  • Tensorflow在Pickle的時候失敗,應該是調用了C的code
  • 我的K8s運行在AWS上,我在K8S上使用內部IP失敗,clipper連接一直在使用外部的域名,導致無法部署PySpark的模型。

總之,除了Sklearn成功部署之外,Tensorflow和Spark都失敗了。大家可以參考這裏的例子

Seldon

Seldon是一家創辦於倫敦的公司,致力於提供對於基於開源軟件的機器學習系統的控制。Seldon Core是該公司開源的提供在Kubernetes上部署機器學習模型的工具。它擁有以下功能:

  • Python/Spark/H2O/R 的模型支持
  • REST API和gRPC接口
  • 部署基於Model/Routers/Combiner/Transformers的圖的微服務
  • 利用K8S來提供擴展,安全性,監控等等DevOps的功能

Seldon的使用過程如上圖,

  1. 首先在K8s上安裝Seldon Core,Seldon利用ksonnet,以CRD的形式安裝seldon core
  2. 利用S2i(s2i是openshift開源的一款工具,用於把代碼構建成容器鏡像),構建運行時模型容器,並註冊到容器註冊表
  3. 編寫你的運行圖,並提交到K8s來部署你的模型

Seldon支持基於四種基本單元,Model,Transformer, Router, Combiner來構建你的運行圖,並按照該圖在K8s創建對應的資源和實例,來獲得AB測試,模型ensemble的功能。

例如下圖的幾個例子:

AB 測試

模型ensemble

複雜圖

圖模式是Seldon最大的亮點,可以訓練不同的模型,然後利用圖來組合出不同的運行時,非常方便。更多的例子參考這裏

筆者嘗試在K8S上利用Seldon部署之前提到的三種工具生成的模型,都獲得了成功(代碼在這裏)。這裏分享一下遇到的幾個問題:

  • Seldon支持Java的Python,然而用運行PySpark,這兩個都需要,所以我不得不自己構建了一個鏡像,手工在Python鏡像上安裝Java
  • 因爲使用CDR的原因,我沒有找到有效改變容器的liveness和readiness的設置,因爲Spark初始化模型在Hadoop上,加載模型需要時間,總是readiness超時導致容器無法正常啓動,K8s不斷的重啓容器。所以我只好修改代碼,讓模型加載變成Lazy Load,但是這樣第一次REST Call會比較耗時,但是容器和服務總算是能夠正常啓動。

在我之前的一篇介紹Kubeflow的文章中,大家可以瞭解到,Kubeflow就是使用Seldon來管理模型部署的。

MLflow

MLflow是Databricks開發的開源系統,用於管理機器學習的端到端的生命週期。我之前寫過一篇介紹該工具的文章

MLflow提供跟蹤,項目管理和模型管理的功能。使用MLFlow來提供一個基於Sklearn的模型服務非常簡單,

from __future__ import print_function

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

import mlflow
import mlflow.sklearn

if __name__ == "__main__":
    data = load_iris()

    X, y = data["data"], data["target"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.33, random_state=42)

    clf = RandomForestClassifier(max_depth=2, random_state=0)
    clf.fit(X_train, y_train)

    print(clf.feature_importances_)

    print(classification_report(y_test, clf.predict(
        X_test), target_names=data["target_names"]))

    mlflow.sklearn.log_model(clf, "model")
    print("Model saved in run %s" % mlflow.active_run().info.run_uuid)

調用mlflow.sklearn.log_model(), MLflow創建以下的目錄來管理模型:

我們看到在artifacts目錄下有Python的pickle文件和另一個元數據文件,MLModel。

artifact_path: model
flavors:
  python_function:
    data: model.pkl
    loader_module: mlflow.sklearn
    python_version: 2.7.10
  sklearn:
    pickled_model: model.pkl
    sklearn_version: 0.20.0
run_id: 44ae85c084904b4ea5bad5aa42c9ce05
utc_time_created: '2018-10-02 23:38:49.786871'

使用 mlflow sklearn serve -m model 就可以很方便的提供基於sklearn的模型服務了。

雖然MLFlow也號稱支持Spark和Tensorflow,但是他們都是基於Python來做,我嘗試使用,但是文檔和例子比較少,所以沒能成功。但原理上都是使用Pickle➕元數據的方式。大家有興趣的可以嘗試一下。

關於部署功能,MLFlow的一個亮點是和SagemakerAzureML的支持。

MLeap

MLeap的目標是提供一個在Spark和Sklearn之間可移植的模型格式,和運行引擎。它包含:

  • 基於JSON的序列化
  • 運行引擎
  • Benchmark

MLeap的架構如下圖:

這是一個使用MLeap導出Sklearn模型的例子:

# Initialize MLeap libraries before Scikit/Pandas
import mleap.sklearn.preprocessing.data
import mleap.sklearn.pipeline
from mleap.sklearn.ensemble import forest
from mleap.sklearn.preprocessing.data import FeatureExtractor

# Import Scikit Transformer(s)
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
input_features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

output_vector_name = 'extracted_features' # Used only for serialization purposes
output_features = [x for x in input_features]

feature_extractor_tf = FeatureExtractor(input_scalars=input_features,
                                        output_vector=output_vector_name,
                                        output_vector_items=output_features)

classification_tf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=2, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
                       oob_score=False, random_state=0, verbose=0, warm_start=False)

classification_tf.mlinit(input_features="features", prediction_column='species',feature_names="features")

rf_pipeline = Pipeline([(feature_extractor_tf.name, feature_extractor_tf),
                        (classification_tf.name, classification_tf)])
rf_pipeline.mlinit()
rf_pipeline.fit(data[input_features],data['species'])

rf_pipeline.serialize_to_bundle('./', 'mleap-scikit-rf-pipeline', init=True)

導出的模型結構如下圖所示:

這個是randonforest的模型json

{
   "attributes": {
      "num_features": {
         "long": 4
      }, 
      "trees": {
         "type": "list", 
         "string": [
            "tree0", 
            "tree1", 
            "tree2", 
            "tree3", 
            "tree4", 
            "tree5", 
            "tree6", 
            "tree7", 
            "tree8", 
            "tree9"
         ]
      }, 
      "tree_weights": {
         "double": [
            1.0, 
            1.0, 
            1.0, 
            1.0, 
            1.0, 
            1.0, 
            1.0, 
            1.0, 
            1.0, 
            1.0
         ], 
         "type": "list"
      }
   }, 
   "op": "random_forest_classifier"
}

我們可以看出MLeap把模型完全序列化成與代碼無關的JSON文件,這樣就可以在不同的運行時工具Spark/Sklearn之間做到可移植。

MLeap對模型提供服務,不需要依賴任何Sklearn或者Spark的代碼。只要啓動MLeap的Server,然後提交模型就好了。

docker run -p 65327:65327 -v /tmp/models:/models combustml/mleap-serving:0.9.0-SNAPSHOT
curl -XPUT -H "content-type: application/json" \
		-d '{"path":"/models/yourmodel.zip"}' \
		http://localhost:65327/model

下面的代碼用Scala在Spark 上訓練一個同樣的Randonforest分類模型,並利用MLeap持久化模型。

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.types.{IntegerType, DoubleType}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

import ml.combust.bundle.BundleFile
import ml.combust.bundle.serializer.SerializationFormat
import ml.combust.mleap.spark.SparkSupport._
import resource._

import org.apache.spark.SparkFiles

spark.sparkContext.addFile("https://s3-us-west-2.amazonaws.com/mlapi-samples/demo/data/input/iris.csv")
val data = spark.read.format("csv").option("header", "true").load(SparkFiles.get("iris.csv"))

//data.show()
//data.printSchema()

// Transform, convert string coloumn to number
// this transform is not part of the pipeline
val featureDf = data.select(data("sepal_length").cast(DoubleType).as("sepal_length"),
                            data("sepal_width").cast(DoubleType).as("sepal_width"),
                            data("petal_width").cast(DoubleType).as("petal_width"),
                            data("petal_length").cast(DoubleType).as("petal_length"),
                            data("species") )

// assember the features
val assembler = new VectorAssembler()
  .setInputCols(Array("sepal_length", "sepal_width", "petal_width", "petal_length"))
  .setOutputCol("features")
  
val output = assembler.transform(featureDf)

// create lable and features
val labelIndexer = new StringIndexer()
  .setInputCol("species")
  .setOutputCol("indexedLabel")
  .fit(output)

val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(output)
  
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = featureDf.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(assembler, labelIndexer, featureIndexer, rf, labelConverter))

// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions.select("predictedLabel", "species", "features").show(5)

// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Test Error = " + (1.0 - accuracy))

val rfModel = model.stages(3).asInstanceOf[RandomForestClassificationModel]
println("Learned classification forest model:\n" + rfModel.toDebugString)

val pipelineModel = SparkUtil.createPipelineModel(Array(model))

for(bundle <- managed(BundleFile("file:/tmp/mleap-examples/rf"))) {
  pipelineModel.writeBundle.format(SerializationFormat.Json).save(bundle)
}

導出的模型和之前的Sklearn具有相同的格式。

MLeap的問題在於要支持所有的算法,對於每一個算法都要實現對應的序列化,這也使得它的需要很多的開發來支持客戶自定義的算法。對於常用算法的支持,大家可以參考這裏

 

其它

除了以上幾個,還有一些我們沒有涉及,有興趣的讀者可以自行搜索。

總結

Seldon Core和K8S結合的很好,它提供的運行圖的方式非常強大,它也是我實驗中唯一一個能夠成功部署Sklearn,Spark和Tensorflow三種模型的工具,非常推薦!

Clipper提供基於K8s和Docker的模型部署,它的模型版本管理做得不錯,但是代碼不太穩定,小問題不少,基於CloudPickle也有不少的限制,只能支持Python也是個問題。推薦給數據科學家有比較多的本地交互的情況。

MLFlow能夠提供很方便的基於Python的模型服務,但是缺乏和容器的結合。但是它能夠支持和Sagemaker,AzureML等雲的支持。推薦給已經在使用這些雲的玩家。

MLeap的特色是支持模型的可交互性,也就是說我可以把sklearn訓練的模型導出在Spark上運行,這的功能很有吸引力,但是要支持全部的算法,它還有很長的路要走。關於機器學習模型標準化的問題,大家也可以關注PMML。現階段各個工具對PMML的支持比較有限,隨着深度學習的廣泛應用,PMML何去何從還未可知。

下表是對以上幾個工具的簡單總結,供大家參考

 

Model Persistent

ML Tools

Kubernetest Integration

Version

License

Implementation

Seldon Core

S2i + Pickle

Tensorflow, SKlearn, Keras, R, H2O, Nodejs, PMML

Yes

0.3.2

Apache

Docker + K8s CRD

Clipper

Pickle

Python, PySpark, PyTorch, Tensorflow, MXnet, Customer Container

Yes

0.3.0

Apache

CPP / Python

MLFlow

Directory + Metadata

Python, H2O, Kera, MLeap, PyTorch, Sklearn, Spark, Tensorflow, R

No

Alpha

Apache

Python

MLeap

 JSON

Spark,Sklearn, Tensorflow

No

0.12.0

Apache

Scala/Java

 

參考

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章