整合HpBandSter:開發與覆盤

開發進度面板

TODO LIST

  1. 熱啓動的支持,並對timestamps做調整保證HpBandSter的可視化功能不出錯 (high)OK
  2. 將SMAC對incumbent的打印功能(調研incumbent_trajectory)移植過來 (high)OK
  3. 加載最好模型時選用budget最大的模型 (medium)OK
  4. 集成學習時支持不同尺寸的驗證數據 (參考PoSH-AutoSklearn) (medium)OK
  5. 支持iterationsbudget_mode,並用redis對workflow做緩存(調研redis的持久化) (high)OK
  6. 連續減半(SuccessiveHaving)刪除廢棄模型的redis鍵值對 (medium)
  7. 對BOHB的kde模型做進一步調研並做充分測試 (low)
  8. 整合scikit-optimize的GP,RF,ET,GBRT代理模型 (low)
  9. 將SMAC的局部搜索功能移植過來 (low)
  10. 調研scikit-optimize的EI, PI, LCB等採集函數,整合EIPS,PIPS等採集函數 (low)
  11. 整合scikit-optimize的可視化功能 (low)
  12. 整合BOAH的可視化功能 (low)
  13. 整合HyperOpt的TPE和adaptive-TPE代理模型 (low)
  14. 調研超參優化庫的benchmark,並在benchmark上測試 (low)
  15. 重新支持手動建模 (medium)
  16. 重新跑通所有單元測試(除了HttpClient) (medium)
  17. 調研HpBandSter系統對於異常記錄是如何表達的,並適配這種表達OK
  18. 避免兩次啓動BOHB隨機採樣時樣本相同問題?OK

異常邊界條件面板

  1. 如果ConfigSpace的樣本空間特別少,例如100次以下,20次以下,是否可以自動切換爲網格搜索?
  2. _get_sorted_trial_records_get_best_k_trial_ids兩個函數的budget_id做出限制。因爲用戶可能更換過budget方案,例如原先用budget=4對應5折,後面用budget=4對應10折。欣慰的是,熱啓動時budget_id是where條件之一。

整合HpBandSter開發方案

實例化BOHB或其他Master需要提供ConfigSpace

class BOHB(Master):
    def __init__(
            self,
            configspace=None,
            eta=3,
            min_budget=0.01,
            max_budget=1,
            min_points_in_model=None,
            top_n_percent=15,
            num_samples=64,
            random_fraction=1 / 3,
            bandwidth_factor=3,
            min_bandwidth=1e-3,
            **kwargs
    ):

配置空間:

configspace=None

HyperBand相關:

eta=3,
min_budget=0.01,
max_budget=1,

CG_BOHB相關:

min_points_in_model=None,
top_n_percent=15,
num_samples=64,
random_fraction=1 / 3,
bandwidth_factor=3,
min_bandwidth=1e-3,
迭代類型
HyperBand
SuccessiveHaving
Simple

但是Simple其實可以看做是SuccessiveHaving的特殊情況(min_budget==max_budget)

所以只需要4個參數就能囊括以上3種迭代類型:

迭代控制參數
min_budget
max_budget
eta(η\eta)
SH_only(這個參數參考PoSH-AutoSklearn)

除了以上的迭代控制參數,用戶再提供一個配置採樣器。在fit函數啓動後,傳入ConfigSpace,就組成了一個優化器對象(Optimizer),這個類繼承自Master。

重構fit,拆成一下幾個phase:

phase comment
input_experimental_data 傳入實驗依賴的各種數據,並計算出task_id, hdl_id
start_nameserver task_id, hdl_id, user_id組裝爲run_id
並結合在Estimator構造參數中傳入的ns_port等參數啓動NS
run_evaluators 根據input_experimental_data傳入的實驗數據,
加上構造參數傳入的n_workersworker_host
參數進行實例化,並啓動。重寫run函數,增加
concurrent_type參數。
run_optimizer 在執行input_experimental_data後,將ConfigSpace參數
和構造器中的其他參數一併實例化Optimizer對象,並啓動。

爲了與HpBandSter語義兼容:
run_workers = run_evaluators
run_master = run_optimizer

trial表新增一些字段

config = self.JSONField(default={})  # new
config_info = self.JSONField(default={})  # new
budget_id = pw.FixedCharField(max_length=32)  # new
budget = pw.FloatField()                      # new
timestamps = self.JSONField(default={}, null=True)  # new

我修改和HpBandSter的代碼,給compute函數增加了config_info字段。現在只有timestamps字段需要DatabaseResultLogger以這種result_logger的形式做記錄插入。


Result類中對HB_config字典的調用

HB_config['max_budget']
HB_config['min_budget']
HB_config['time_ref'] (_merge_results)
HB_config['budgets']

budget解釋與表達,覆盤

budget解釋與表達,覆盤:

首先用戶要在core/base.py中指定min_budget, max_budget, eta, 如果是holdout驗證則max_budget爲1,否則一般來說爲eta。並且通過budget2kfold = {eta: n_splits}來將budget映射爲一個完全K折驗證。用戶也可以自定義budget2kfold

進入到evaluation.train_evaluator.TrainEvaluator#evaluate

如果當前驗證的模型的擬合器爲迭代算法(iterative algorithm),則應該指定max_iter

is_iter_algo = self.algo2iter.get(final_model_name) is not None
max_iter = -1
# if final model is iterative algorithm, max_iter should be specified
if is_iter_algo:
    if budget_mode == ITERATIONS_BUDGET_MODE:
        fraction = min(1, budget)
    else:
        fraction = 1
    max_iter = max(round(self.algo2iter[final_model_name] * fraction), 1)

fraction是根據budget算出來的調整係數,注意是絕對不能大於1的。

對於subsamples budget_mode,在budget<1,折數爲0時,對X_train, y_train進行行採樣。如果採樣後X_train行數<列數(矮胖 的特徵矩陣,有過擬合風險),則應該對列也進行採樣至行數==列數。同時,X_validX_test也要進行列採樣,以保證數據可以正常跑通。

# subsamples budget_mode.
if fold_ix == 0 and budget_mode == SUBSAMPLES_BUDGET_MODE and budget < 1:
   X_train, y_train, (X_valid, X_test) = implement_subsample_budget(
       X_train, y_train, [X_valid, X_test],
       budget, self.random_state
   )

evaluation.budget.implement_subsample_budget

def implement_subsample_budget(
        X_train: DataFrameContainer, y_train: NdArrayContainer,
        Xs: List[Optional[DataFrameContainer]],
        budget, random_state: int
) -> Tuple[DataFrameContainer, NdArrayContainer, List[Optional[DataFrameContainer]]]:
    rng = np.random.RandomState(random_state)
    samples = round(X_train.shape[0] * budget)
    features = X_train.shape[1]
    sub_sample_index = get_stratified_sampling_index(y_train.data, budget, random_state)
    # sub sampling X_train, y_train
    X_train = X_train.sub_sample(sub_sample_index)
    y_train = y_train.sub_sample(sub_sample_index)
    # if features > samples , do sub_feature avoid over-fitting
    if features > samples:
        sub_feature_index = rng.permutation(X_train.shape[1])[:samples]
        X_train = X_train.sub_feature(sub_feature_index)
        res_Xs = []
        for X in Xs:
            res_Xs.append(X.sub_feature(sub_feature_index) if X is not None else None)
    else:
        res_Xs = Xs
    return X_train, y_train, res_Xs

關於分層抽樣,注意:

  1. 對於迴歸任務的target,用KBinsDiscretizer劃分爲5個bins,strategy="kmeans"
  2. 對於每個label,至少要有一個樣本,否則在迴歸任務中,會出現unseen label的錯誤。

爲了讓曾經擬合過的模型,在僅增加 final_modelmax_iter時不再重新擬合,我設計了一個緩存系統對model 進行保存。key的規則:cache_key = self.get_cache_key(config_id, X_train, y_train),如果緩存存在,就將加載出來的模型賦值爲cloned_model

cached_model = self.resource_manager.cache.get(cache_key)
if cached_model is not None:
    cloned_model = cached_model

看到workflow.ml_workflow.ML_Workflow#procedure,這個函數爲了適應iterations budget_mode增加了一個max_iter參數。

if max_iter > 0:
    # set final model' max_iter param
    self[-1].set_max_iter(max_iter)
if max_iter > 0 and self.fitted:
    self.last_data = self.transform(X_train, X_valid, X_test, y_train)
    self[-1].fit(
        self.last_data.get("X_train"), self.last_data.get("y_train"),
        self.last_data.get("X_valid"),y_valid,
        self.last_data.get("X_test"), y_test
    )
else:
    self.fit(X_train, y_train, X_valid, y_valid, X_test, y_test)

在這裏,所有的iterative algorithm都有相同的函數set_max_iter。對於sklearn模型是用IterComponent類實現的,對於boosting模型是用BoostingModelMixin實現的。

如果max_iter > 0並且self.fitted,只擬合最後一個模型,否則對整個工作流擬合一遍。

擬合完之後,就應該保存模型了。

# save model as cache
if (budget_mode == ITERATIONS_BUDGET_MODE and budget <= 1 and
    isinstance(final_model, IterComponent)) or \
        (budget == 1):
    self.resource_manager.cache.set(cache_key, cloned_model)

isinstance(final_model, IterComponent)是一個比較苟的判斷條件,因爲boosting模型都沒有實現熱啓動,存下來也沒有意義。

後續應該支持boosting模型的熱啓動和incremental learning,參考https://gist.github.com/goraj/6df8f22a49534e042804a299d81bf2d6

在K折循環代碼塊的最後,如果budget<=1,則爲holdout驗證,所以應該在fold_ix==0時退出。

# when  budget  <= 1 , hold out validation
if fold_ix == 0 and budget <= 1:
    break

同時,budget>1的情況屬於cross-validation的evaluation了,需要判斷K折驗證已經進行幾折了,適時退出。

# when  budget  > 1 , budget will be interpreted as kfolds num by 'budget2kfold'
# for example, budget = 4 , budget2kfold = {4: 10}, we only do 10 times cross-validation,
# so we break when fold_ix == 10 - 1 == 9
if budget > 1 and fold_ix == self.budget2kfold[budget] - 1:
    break

HpBandSter對worker的job失敗的處理方式

hpbandster.core.worker.Worker#start_computation

        try:
            result = {'result': self.compute(*args, config_id=config_id, **kwargs),
                      'exception': None}
        except Exception as e:
            result = {'result': None,
                      'exception': traceback.format_exc()}

如果出現異常了,result爲None,並且捕獲exception

現在的做法是debug默認爲False,即在evaluate函數中自己處理異常

這個result=None傳來傳去最後到這裏hpbandster.optimizers.config_generators.bohb.BOHB#new_result

        if job.result is None:
            # One could skip crashed results, but we decided to
            # assign a +inf loss and count them as bad configurations
            loss = np.inf
        else:
            # same for non numeric losses.
            # Note that this means losses of minus infinity will count as bad!
            loss = job.result["loss"] if np.isfinite(job.result["loss"]) else np.inf

結論:

  1. 現有的形式不用改變。因爲程序運行狀態(SUCCESS, FAILED, TIMEOUT)等都可以在evaluate函數中去記錄,並插入到trial表中。
  2. 需要留意的是,當debug=True時,異常應該能被順利拋出
  3. 如果異常並不發生在evaluate函數包裹的ML_Workflow.procedure函數,而是發生的函數外部而觸發異常,是否會影響系統的正確運行。

BOHB:bug與新feature,覆盤

1. feature: 實現config去重

我在開發時注意到了一個問題,就是BOHB會推薦以前生成過的config,這一般是因爲ConfigSpace隨機種子相同,採樣也相同導致的。

於是我將config推薦功能收納進_get_config,在get_config中實現config去重

def get_config(self, budget):
    max_sample = 1000
    i = 0
    while i < max_sample and self.configs.get(budget) is not None:
        i += 1
        sample, info_dict = self._get_config(budget)
        array:np.ndarray = ConfigSpace.Configuration(configuration_space=self.configspace, values=sample).get_array()
        X=np.array(self.configs[budget])
        array[np.isnan(array)]=-1
        X[np.isnan(X)] = -1
        if np.any(np.all(array == X, axis=1)):
            self.logger.info(f"The sample already exists and needs to be resampled. It's the {i}-th time resampling. ")
            self.logger.debug(f"Config = \n{sample}")
        else:
            return sample, info_dict
    seed = np.random.randint(1, 8888)
    self.configspace.seed(seed)
    sample = self.configspace.sample_configuration().get_dictionary()
    info_dict = {
        "model_based_pick": False,
        "seed": seed,
        "sampling_different_samples_failed": True
    }
    return sample, info_dict

對於重採樣的最大容忍度是1000次。

記錄一個邊界條件:
如果ConfigSpace的樣本空間特別少,例如100次以下,20次以下,是否可以自動切換爲網格搜索?

2. 將vector convert爲Configuration時,失敗一次後就放棄。

3. statasmodels除零錯

statsmodels/nonparametric/kernels.py:62


def aitchison_aitken(h, Xi, x, num_levels=None):
    Xi = Xi.reshape(Xi.size)  # seems needed in case Xi is scalar
    if num_levels is None:
        num_levels = np.asarray(np.unique(Xi).size)

    kernel_value = np.ones(Xi.size) * h / (num_levels - 1)
    idx = Xi == x
    kernel_value[idx] = (idx * (1 - h))[idx]
    return kernel_value
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章