在使用sklearn加載大數據量的libsvm文件函數load_svmlight_file發生了內存越界錯誤,樣本數超過1千萬。
具體報錯:
OverflowError: signed integer is greater than maximum.
這個問題比較奇怪,之前一直沒有問題,只是每個樣本都add了固定的128維特徵後纔出現上述報錯。
通過對sklearn源碼分析,sklearn使用scipy的csr稀疏矩陣存儲形式,索引數組使用了int作爲下標,因此限定了數組的最大長度爲2147483647,如果樣本數 * 每個樣本的特徵數 超過2147483647,數組就會越界,報上述錯誤。以下爲_svmlight_format.pyx 中定義數組的代碼,可以看到indices數組和indptr都是用i(int)類型。
# Special-case float32 but use float64 for everything else;
# the Python code will do further conversions.
if dtype == np.float32:
data = array.array("f")
else:
dtype = np.float64
data = array.array("d")
indices = array.array("i")
indptr = array.array("i", [0])
query = np.arange(0, dtype=np.int64)
爲了解決上述問題,可以參考liblinear的加載libsvm格式文件的代碼,打開liblinear-2.21/python/commonutil.py, 其中提供了svm_read_problem函數,該函數使用long類型做數據下標,可以避免數據量太大導致越界的錯誤。
def svm_read_problem(data_file_name,return_scipy=False):
"""
svm_read_problem(data_file_name, return_scipy=False) -> [y, x], y: list, x: list of dictionary
svm_read_problem(data_file_name, return_scipy=True) -> [y, x], y: ndarray, x: csr_matrix
Read LIBSVM-format data from data_file_name and return labels y
and data instances x.
"""
if scipy != None and return_scipy:
prob_y = array('d')
prob_x = array('d')
row_ptr = array('l', [0])
col_idx = array('l')
else:
prob_y = []
prob_x = []
row_ptr = [0]
col_idx = []
該代碼中col_idx,row_ptr分別等同於上述代碼indices,indptr。svm_read_problem參數中不像load_svmlight_file有一個feature_size參數,可以修改svm_read_problem函數添加feature_size參數,如下:
def svm_read_problem(data_file_name,n_features,return_scipy=False):
"""
svm_read_problem(data_file_name, return_scipy=False) -> [y, x], y: list, x: list of dictionary
svm_read_problem(data_file_name, return_scipy=True) -> [y, x], y: ndarray, x: csr_matrix
Read LIBSVM-format data from data_file_name and return labels y
and data instances x.
"""
if scipy != None and return_scipy:
prob_y = array('d')
prob_x = array('d')
row_ptr = array('l', [0])
col_idx = array('l')
else:
prob_y = []
prob_x = []
row_ptr = [0]
col_idx = []
indx_start = 1
for i, line in enumerate(open(data_file_name)):
line = line.split(None, 1)
# In case an instance with all zero features
if len(line) == 1: line += ['']
label, features = line
prob_y.append(float(label))
if scipy != None and return_scipy:
nz = 0
for e in features.split():
ind, val = e.split(":")
if ind == '0':
indx_start = 0
val = float(val)
if val != 0:
col_idx.append(int(ind)-indx_start)
prob_x.append(val)
nz += 1
row_ptr.append(row_ptr[-1]+nz)
else:
xi = {}
for e in features.split():
ind, val = e.split(":")
xi[int(ind)] = float(val)
prob_x += [xi]
if scipy != None and return_scipy:
prob_y = scipy.frombuffer(prob_y, dtype='d')
prob_x = scipy.frombuffer(prob_x, dtype='d')
col_idx = scipy.frombuffer(col_idx, dtype='l')
row_ptr = scipy.frombuffer(row_ptr, dtype='l')
prob_x = sparse.csr_matrix((prob_x, col_idx, row_ptr),(row_ptr.shape[0]-1,n_features))
return (prob_y, prob_x)