如何解決sklearn加載libsvm格式數據數組越界？

原創

2018-12-15 01:06

在使用sklearn加載大數據量的libsvm文件函數load_svmlight_file發生了內存越界錯誤，樣本數超過1千萬。

具體報錯：

OverflowError: signed integer is greater than maximum.

這個問題比較奇怪，之前一直沒有問題，只是每個樣本都add了固定的128維特徵後纔出現上述報錯。

通過對sklearn源碼分析，sklearn使用scipy的csr稀疏矩陣存儲形式，索引數組使用了int作爲下標，因此限定了數組的最大長度爲2147483647，如果樣本數 * 每個樣本的特徵數超過2147483647，數組就會越界，報上述錯誤。以下爲_svmlight_format.pyx 中定義數組的代碼，可以看到indices數組和indptr都是用i（int）類型。

# Special-case float32 but use float64 for everything else;
# the Python code will do further conversions.
if dtype == np.float32:
    data = array.array("f")
else:
    dtype = np.float64
    data = array.array("d")
indices = array.array("i")
indptr = array.array("i", [0])
query = np.arange(0, dtype=np.int64)

爲了解決上述問題，可以參考liblinear的加載libsvm格式文件的代碼，打開liblinear-2.21/python/commonutil.py, 其中提供了svm_read_problem函數，該函數使用long類型做數據下標，可以避免數據量太大導致越界的錯誤。

def svm_read_problem(data_file_name,return_scipy=False):
    """
    svm_read_problem(data_file_name, return_scipy=False) -> [y, x], y: list, x: list of dictionary
    svm_read_problem(data_file_name, return_scipy=True)  -> [y, x], y: ndarray, x: csr_matrix

    Read LIBSVM-format data from data_file_name and return labels y
    and data instances x.
    """
    if scipy != None and return_scipy:
        prob_y = array('d')
        prob_x = array('d')
        row_ptr = array('l', [0])
        col_idx = array('l')
    else:
        prob_y = []
        prob_x = []
        row_ptr = [0]
        col_idx = []

該代碼中col_idx，row_ptr分別等同於上述代碼indices，indptr。svm_read_problem參數中不像load_svmlight_file有一個feature_size參數，可以修改svm_read_problem函數添加feature_size參數，如下：

def svm_read_problem(data_file_name,n_features,return_scipy=False):
    """
    svm_read_problem(data_file_name, return_scipy=False) -> [y, x], y: list, x: list of dictionary
    svm_read_problem(data_file_name, return_scipy=True)  -> [y, x], y: ndarray, x: csr_matrix

    Read LIBSVM-format data from data_file_name and return labels y
    and data instances x.
    """
    if scipy != None and return_scipy:
        prob_y = array('d')
        prob_x = array('d')
        row_ptr = array('l', [0])
        col_idx = array('l')
    else:
        prob_y = []
        prob_x = []
        row_ptr = [0]
        col_idx = []
    indx_start = 1
    for i, line in enumerate(open(data_file_name)):
        line = line.split(None, 1)
        # In case an instance with all zero features
        if len(line) == 1: line += ['']
        label, features = line
        prob_y.append(float(label))
        if scipy != None and return_scipy:
            nz = 0
            for e in features.split():
                ind, val = e.split(":")
                if ind == '0':
                    indx_start = 0
                val = float(val)
                if val != 0:
                    col_idx.append(int(ind)-indx_start)
                    prob_x.append(val)
                    nz += 1
            row_ptr.append(row_ptr[-1]+nz)
        else:
            xi = {}
            for e in features.split():
                ind, val = e.split(":")
                xi[int(ind)] = float(val)
            prob_x += [xi]
    if scipy != None and return_scipy:
        prob_y = scipy.frombuffer(prob_y, dtype='d')
        prob_x = scipy.frombuffer(prob_x, dtype='d')
        col_idx = scipy.frombuffer(col_idx, dtype='l')
        row_ptr = scipy.frombuffer(row_ptr, dtype='l')
        prob_x = sparse.csr_matrix((prob_x, col_idx, row_ptr),(row_ptr.shape[0]-1,n_features))
    return (prob_y, prob_x)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

如何解決sklearn加載libsvm格式數據數組越界？

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

sklearn中邏輯迴歸（logistic regression）的損失函數推導

資訊信息流場景的學習排序實踐探討

如何解決sklearn加載libsvm格式數據數組越界？

推薦系統中的冷啓動和探索利用問題探討 (下)

推薦系統中的冷啓動和探索利用問題探討 (上)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結