文章目錄

6.6 tf.estimator使用入門

學習目標

目標
- 知道tf.estimator的使用流程
- 瞭解什麼是premade estimator
應用
- 應用tf.estimator完成美國普查數據的二分類

6.6.1 tf.estimator介紹

TensorFlow 中的 tf.estimator API 封裝了基礎的機器學習模型。Estimator 是可擴展性最強且面向生產的 TensorFlow 模型類型。

本文檔介紹了 Estimator - 一種可極大地簡化機器學習編程的高階 TensorFlow API。Estimator 會封裝下列操作：

訓練
評估
預測
導出以供使用

Estimator 的優勢

Estimator 具有下列優勢：

您可以在本地主機上或分佈式多服務器環境中運行基於 Estimator 的模型，而無需更改模型。此外，您可以在 CPU、GPU 或 TPU 上運行基於 Estimator 的模型，而無需重新編碼模型。
Estimator 簡化了在模型開發者之間共享實現的過程。
您可以使用高級直觀代碼開發先進的模型。簡言之，採用 Estimator 創建模型通常比採用低階 TensorFlow API 更簡單。
Estimator 本身在 tf.layers 之上構建而成，可以簡化自定義過程。
Estimator 會爲您構建圖。
Estimator 提供安全的分佈式訓練循環，可以控制如何以及何時：
- 構建圖
- 初始化變量
- 開始排隊
- 處理異常
- 創建檢查點文件並從故障中恢復
- 保存 TensorBoard 的摘要

使用 Estimator 編寫應用時，您必須將數據輸入管道從模型中分離出來。這種分離簡化了不同數據集的實驗流程。

預創建的 Estimator

藉助預創建的 Estimator，您能夠在比基本 TensorFlow API 高級很多的概念層面上進行操作。由於 Estimator 會爲您處理所有“管道工作”，因此您不必再爲創建計算圖或會話而操心。也就是說，預創建的 Estimator 會爲您創建和管理 Graph 和 Session 對象。此外，藉助預創建的 Estimator，您只需稍微更改下代碼，就可以嘗試不同的模型架構。例如，DNNClassifier 是一個預創建的 Estimator 類，它根據密集的前饋神經網絡訓練分類模型。

預創建的 Estimator 程序的結構

依賴預創建的 Estimator 的 TensorFlow 程序通常包含下列四個步驟：

編寫一個或多個數據集導入函數。 例如，您可以創建一個函數來導入訓練集，並創建另一個函數來導入測試集。每個數據集導入函數都必須返回兩個對象：
- 一個字典，其中鍵是特徵名稱，值是包含相應特徵數據的張量（或 SparseTensor）
- 一個包含一個或多個標籤的張量
例如，以下代碼展示了輸入函數的基本框架：
```
def input_fn(dataset):
   ...  # manipulate dataset, extracting the feature dict and the label
   return feature_dict, label
```
（要了解完整的詳細信息，請參閱導入數據。）
定義特徵列。 每個 tf.feature_column 都標識了特徵名稱、特徵類型和任何輸入預處理操作。例如，以下代碼段創建了三個存儲整數或浮點數據的特徵列。前兩個特徵列僅標識了特徵的名稱和類型。第三個特徵列還指定了一個 lambda，該程序將調用此 lambda 來調節原始數據：
```
# Define three numeric feature columns.
population = tf.feature_column.numeric_column('population')
crime_rate = tf.feature_column.numeric_column('crime_rate')
median_education = tf.feature_column.numeric_column('median_education',
                    normalizer_fn=lambda x: x - global_education_mean)
```

實例化相關的預創建的 Estimator。 例如，下面是對名爲 LinearClassifier 的預創建 Estimator 進行實例化的示例代碼：

# Instantiate an estimator, passing the feature columns.
estimator = tf.estimator.LinearClassifier(
    feature_columns=[population, crime_rate, median_education],
    )

**調用訓練、評估或推理方法。**例如，所有 Estimator 都提供訓練模型的 train 方法。

# my_training_set is the function created in Step 1estimator.train(input_fn=my_training_set, steps=2000)

6.6.1.1 Premade Estimators

pre-made Estimators是基類tf.estimator.Estimator的子類，而定製的estimators是tf.estimator.Estimator的實例：

pre-made Estimators是已經做好的。但有時候，你需要對一個Estimator的行爲做更多控制。這時候就需要定製Estimators了。你可以創建一個定製版的Estimator來做任何事。如果你希望hidden layers以某些不常見的方式進行連接，可以編寫一個定製的Estimator。如果你想爲你的模型計算一個唯一的metric，可以編寫一個定製的Estimator。基本上，如果你想爲特定的問題進行優化，你可編寫一個定製的Estimator。

6.6.2 案例：使用美國普查數據分類

1994 年和 1995 年的美國普查收入數據集。解決的是二元分類問題，目標標籤爲：如果收入超過 5 萬美元，則該值爲 1；否則，該值爲 0。

‘train’: 32561
‘validation’: 16281

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	gender	capital_gain	hours_per_week	native_country	income_bracket
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

這些列分爲兩類 - 類別列和連續列：

如果某個列的值只能是一個有限集合中的類別之一，則該列稱爲類別列。例如，婚戀狀況（妻子、丈夫、未婚等）或受教育程度（高中、大學等）屬於類別列。
如果某個列的值可以是連續範圍內的任意數值，則該列稱爲連續列。例如，一個人的資本收益（如 14084 美元）屬於連續列。

6.6.2.1 案例實現

目的：對普查收入數據進行二分類預測
步驟：
- 1、讀取美國普查收入數據
- 2、模型選擇特徵並進行特徵工程處理
- 3、模型訓練與評估

1、讀取美國普查收入數據

tf.data API可以很方便地以不同的數據格式處理大量的數據，以及處理複雜的轉換。

讀取csv文件接口：tf.data.TextLineDataset()
- 路徑+文件名稱列表
- 返回：Dataset結構

本地數據文件，adult.data以及adult.test

讀取的相關設置

_CSV_COLUMNS = [
    'age', 'workclass', 'fnlwgt', 'education', 'education_num',
    'marital_status', 'occupation', 'relationship', 'race', 'gender',
    'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
    'income_bracket'
]

_CSV_COLUMN_DEFAULTS = [[0], [''], [0], [''], [0], [''], [''], [''], [''], [''],
                        [0], [0], [0], [''], ['']]


train_file = "/root/toutiao_project/reco_sys/server/models/data/adult.data"
test_file = "/root/toutiao_project/reco_sys/server/models/data/adult.test"

輸入函數代碼

def input_fn(data_file, num_epochs, shuffle, batch_size):
  def parse_csv(value):
    columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
    features = dict(zip(_CSV_COLUMNS, columns))
    labels = features.pop('income_bracket')
    classes = tf.equal(labels, '>50K')
    return features, classes

  # 讀取csv文件
  dataset = tf.data.TextLineDataset(data_file)
  dataset = dataset.map(parse_csv)
  dataset = dataset.repeat(num_epochs)
  dataset = dataset.batch(batch_size)
  return dataset

2、模型選擇特徵並進行特徵工程處理

Estimator 使用名爲特徵列的機制來描述模型應如何解讀每個原始輸入特徵。Estimator 需要數值輸入向量，而特徵列會描述模型應如何轉換每個特徵。

選擇和創建一組正確的特徵列是學習有效模型的關鍵。特徵列可以是原始特徵 dict 中的其中一個原始輸入（基準特徵列），也可以是對一個或多個基準列進行轉換而創建的任意新列（衍生特徵列）。

特徵列是一個抽象概念，表示可用於預測目標標籤的任何原始變量或衍生變量。

數值列

最簡單的 feature_column 是 numeric_column。它表示特徵是數值，應直接輸入到模型中。例如：

age = tf.feature_column.numeric_column('age')
education_num = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hours_per_week = tf.feature_column.numeric_column('hours_per_week')

numeric_columns = [age, education_num, capital_gain, capital_loss, hours_per_week]

類別列

要爲類別特徵定義特徵列，請使用其中一個 tf.feature_column.categorical_column* 函數創建 CategoricalColumn。如果您知道某個列的所有可能特徵值的集合，並且集合中只有幾個值，請使用 categorical_column_with_vocabulary_list。列表中的每個鍵會被分配自動遞增的 ID（從 0 開始）。例如，對於 relationship 列，我們可以將整數 ID 0 分配給特徵字符串 Husband，將 1 分配給“Not-in-family”，以此類推。

relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    'relationship',
    ['Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried', 'Other-relative'])

occupation = tf.feature_column.categorical_column_with_hash_bucket(
    'occupation', hash_bucket_size=1000)

education = tf.feature_column.categorical_column_with_vocabulary_list(
    'education', [
        'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
        'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
        '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])

marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    'marital_status', [
        'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
        'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])

workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    'workclass', [
        'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
        'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])

categorical_columns = [relationship, occupation, education, marital_status, workclass]

4、模型訓練與評估

輸入到train當中的train_inpf只是將函數名稱放入，如要將原先的input_fn中參數進行取出。可以使用該方法functools.partial方法

import functools

def add(a, b):
    return a + b

add(4, 2)
6

plus3 = functools.partial(add, 3)
plus5 = functools.partial(add, 5)

plus3(4)
7
plus3(7)
10

plus5(10)
15

partial方法使用在數據集

import functools

train_inpf = functools.partial(input_fn, train_file, num_epochs=2, shuffle=True, batch_size=64)
test_inpf = functools.partial(input_fn, test_file, num_epochs=1, shuffle=False, batch_size=64)

tf.estimator進行初始化訓練評估：

classifier = tf.estimator.LinearClassifier(feature_columns=numeric_columns + categorical_columns)
classifier.train(train_inpf)
result = classifier.evaluate(test_inpf)
# result是一個字典格式，裏面包含評估指標
for key, value in sorted(result.items()):
  print('%s: %s' % (key, value))

簡單粗暴的Tensoflow框架從入門到精通（六）：tf.estimator使用入門、案例：使用美國普查數據分類

文章目錄

6.6 tf.estimator使用入門

學習目標

6.6.1 tf.estimator介紹

6.6.1.1 Premade Estimators

6.6.2 案例：使用美國普查數據分類

6.6.2.1 案例實現

對促銷活動效果評估的一些思考

SQL中怎麼表示不等於你知道嗎？SQL中怎麼篩選奇數偶數？

對電商數據分析中用戶分析的思考

對預測銷售情況的一些思考，需要從那幾個方面去分析呢？

總體分佈概況符合無界約翰遜分佈（johnsonsu）的情況

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結