目錄
class torch.utils.data.Sampler(data_source)[source]
class torch.utils.data.SequentialSampler(data_source)[source]
class torch.utils.data.RandomSampler(data_source, replacement=False, num_samples=None)[source]
class torch.utils.data.SubsetRandomSampler(indices)[source]
class torch.utils.data.WeightedRandomSampler(weights, num_samples, replacement=True)[source]
class torch.utils.data.BatchSampler(sampler, batch_size, drop_last)[source]
採樣器的返回值是一個索引列表,用於在訓練集中查找訓練樣本,一般總的元素數是數據集的長度。
class torch.utils.data.
Sampler
(data_source)[source]
所有采樣器的基類。
每個採樣器的子類必須提供一個__iter__()方法,提供一個數據集元素指數上進行迭代的方法,並且__len__()方法返回迭代器的長度。
注意:
在Dataloader中__len__()方法不是嚴格需要的,但是在任何包含Datalaoder長度的計算中都需要。
class torch.utils.data.
SequentialSampler
(data_source)[source]
順序的採樣元素,通常以相同的順序。
參數:
data_source (Dataset) – 數據集的來源
class torch.utils.data.
RandomSampler
(data_source, replacement=False, num_samples=None)[source]
隨機採樣元素。如果不能重複採樣,樣本來自打亂後的數據集。如果可以重複採樣,使用者可以指定需要的樣本數num_samples。
參數:
-
data_source (Dataset) – 需要採樣的數據集
-
replacement (bool) – 是否可以重複採樣
-
num_samples (int) – 需要採樣的樣本數,默認爲數據集的長度,參數僅僅在可以重複爲真實設置。
class torch.utils.data.
SubsetRandomSampler
(indices)[source]
從給定的指數列表中隨機採樣,不可以重複採樣。
參數:
- indices (sequence) – 指數的序列
class torch.utils.data.
WeightedRandomSampler
(weights, num_samples, replacement=True)[source]
從[0,..,len(weights)-1]中以給定的概率(權重)進行採樣元素。
參數:
-
weights (sequence) – 一個權重序列,不必要不需要加起來是1。
-
num_samples (int) – 需要採樣的樣本數。
-
replacement (bool) – 如果爲真的話,樣本可以進行重複採樣。如果爲假,不可以進行重複採樣,這意味着當一個樣本指數來自某行時,對那行不能再一次進行採樣。
Example
>>> list(WeightedRandomSampler([0.1, 0.9, 0.4, 0.7, 3.0, 0.6], 5, replacement=True))
[4, 4, 1, 4, 5]
>>> list(WeightedRandomSampler([0.9, 0.4, 0.05, 0.2, 0.3, 0.1], 5, replacement=False))
[0, 1, 4, 3, 2]
class torch.utils.data.
BatchSampler
(sampler, batch_size, drop_last)[source]
包裹另一個採樣器來產生指數的mini-batch。
參數:
-
sampler (Sampler or Iterable) – 基採樣器,任何用__len__()實現的可迭代採樣器都可以。
-
batch_size (int) – min-batch的尺寸。
-
drop_last (bool) – 如果爲真,採樣器將會下降到最後一個batch,如果它的尺寸比batch_size小的話。
Example:
>>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
>>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]
class torch.utils.data.distributed.
DistributedSampler
(dataset, num_replicas=None, rank=None, shuffle=True, seed=0)[source]
Sampler that restricts data loading to a subset of the dataset.
限制數據載入成爲數據集子集的採樣器。
It is especially useful in conjunction with torch.nn.parallel.DistributedDataParallel
. In such a case, each process can pass a :class`~torch.utils.data.DistributedSampler` instance as a DataLoader
sampler, and load a subset of the original dataset that is exclusive to it.
和torch.nn.parallel.DistributedDataParallel一起使用很有必要。在這種情況下,每個過程能通過一個類
torch.utils.data.DistributedSampler實例作爲一個DataLoader採樣器,並且載入除了它的原始數據集的子集。
注意
數據集假定是一個固定的尺寸。
參數:
-
dataset – 用來進行採樣的數據集。
-
num_replicas (int, optional) – 參與到分佈式訓練的進程數。默認情況下,rank來自當前的分佈式組。
-
rank (int, optional) – num_replicas內當前進程的rank。默認情況下,rank來自當前分佈式的組。
-
shuffle (bool, optional) – 如果是真的話,採樣器將會打亂指數。
-
seed (int, optional) – 如果打亂的話,用來打亂採樣器的隨機種子。在分佈式group的所有進程上數量將是一樣的。默認是0。
注意:
在分佈式模式中稱爲:meth`set_epoch(epoch) <set_epoch>`方法,在每個epoch開始的時候。在創建DataLoader之前,迭代器有必要通過多epochs來進行適當的打亂。否則,總是使用相同的順序。
例:
>>> sampler = DistributedSampler(dataset) if is_distributed else None
>>> loader = DataLoader(dataset, shuffle=(sampler is None),
... sampler=sampler)
>>> for epoch in range(start_epoch, n_epochs):
... if is_distributed:
... sampler.set_epoch(epoch)
... train(loader)
源代碼
import torch
from torch._six import int_classes as _int_classes
[docs]class Sampler(object):
r"""Base class for all Samplers.
Every Sampler subclass has to provide an :meth:`__iter__` method, providing a
way to iterate over indices of dataset elements, and a :meth:`__len__` method
that returns the length of the returned iterators.
.. note:: The :meth:`__len__` method isn't strictly required by
:class:`~torch.utils.data.DataLoader`, but is expected in any
calculation involving the length of a :class:`~torch.utils.data.DataLoader`.
"""
def __init__(self, data_source):
pass
def __iter__(self):
raise NotImplementedError
# NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ]
#
# Many times we have an abstract class representing a collection/iterable of
# data, e.g., `torch.utils.data.Sampler`, with its subclasses optionally
# implementing a `__len__` method. In such cases, we must make sure to not
# provide a default implementation, because both straightforward default
# implementations have their issues:
#
# + `return NotImplemented`:
# Calling `len(subclass_instance)` raises:
# TypeError: 'NotImplementedType' object cannot be interpreted as an integer
#
# + `raise NotImplementedError()`:
# This prevents triggering some fallback behavior. E.g., the built-in
# `list(X)` tries to call `len(X)` first, and executes a different code
# path if the method is not found or `NotImplemented` is returned, while
# raising an `NotImplementedError` will propagate and and make the call
# fail where it could have use `__iter__` to complete the call.
#
# Thus, the only two sensible things to do are
#
# + **not** provide a default `__len__`.
#
# + raise a `TypeError` instead, which is what Python uses when users call
# a method that is not defined on an object.
# (@ssnl verifies that this works on at least Python 3.7.)
[docs]class SequentialSampler(Sampler):
r"""Samples elements sequentially, always in the same order.
Arguments:
data_source (Dataset): dataset to sample from
"""
def __init__(self, data_source):
self.data_source = data_source
def __iter__(self):
return iter(range(len(self.data_source)))
def __len__(self):
return len(self.data_source)
[docs]class RandomSampler(Sampler):
r"""Samples elements randomly. If without replacement, then sample from a shuffled dataset.
If with replacement, then user can specify :attr:`num_samples` to draw.
Arguments:
data_source (Dataset): dataset to sample from
replacement (bool): samples are drawn with replacement if ``True``, default=``False``
num_samples (int): number of samples to draw, default=`len(dataset)`. This argument
is supposed to be specified only when `replacement` is ``True``.
"""
def __init__(self, data_source, replacement=False, num_samples=None):
self.data_source = data_source
self.replacement = replacement
self._num_samples = num_samples
if not isinstance(self.replacement, bool):
raise TypeError("replacement should be a boolean value, but got "
"replacement={}".format(self.replacement))
if self._num_samples is not None and not replacement:
raise ValueError("With replacement=False, num_samples should not be specified, "
"since a random permute will be performed.")
if not isinstance(self.num_samples, int) or self.num_samples <= 0:
raise ValueError("num_samples should be a positive integer "
"value, but got num_samples={}".format(self.num_samples))
@property
def num_samples(self):
# dataset size might change at runtime
if self._num_samples is None:
return len(self.data_source)
return self._num_samples
def __iter__(self):
n = len(self.data_source)
if self.replacement:
return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
return iter(torch.randperm(n).tolist())
def __len__(self):
return self.num_samples
[docs]class SubsetRandomSampler(Sampler):
r"""Samples elements randomly from a given list of indices, without replacement.
Arguments:
indices (sequence): a sequence of indices
"""
def __init__(self, indices):
self.indices = indices
def __iter__(self):
return (self.indices[i] for i in torch.randperm(len(self.indices)))
def __len__(self):
return len(self.indices)
[docs]class WeightedRandomSampler(Sampler):
r"""Samples elements from ``[0,..,len(weights)-1]`` with given probabilities (weights).
Args:
weights (sequence) : a sequence of weights, not necessary summing up to one
num_samples (int): number of samples to draw
replacement (bool): if ``True``, samples are drawn with replacement.
If not, they are drawn without replacement, which means that when a
sample index is drawn for a row, it cannot be drawn again for that row.
Example:
>>> list(WeightedRandomSampler([0.1, 0.9, 0.4, 0.7, 3.0, 0.6], 5, replacement=True))
[4, 4, 1, 4, 5]
>>> list(WeightedRandomSampler([0.9, 0.4, 0.05, 0.2, 0.3, 0.1], 5, replacement=False))
[0, 1, 4, 3, 2]
"""
def __init__(self, weights, num_samples, replacement=True):
if not isinstance(num_samples, _int_classes) or isinstance(num_samples, bool) or \
num_samples <= 0:
raise ValueError("num_samples should be a positive integer "
"value, but got num_samples={}".format(num_samples))
if not isinstance(replacement, bool):
raise ValueError("replacement should be a boolean value, but got "
"replacement={}".format(replacement))
self.weights = torch.as_tensor(weights, dtype=torch.double)
self.num_samples = num_samples
self.replacement = replacement
def __iter__(self):
return iter(torch.multinomial(self.weights, self.num_samples, self.replacement).tolist())
def __len__(self):
return self.num_samples
[docs]class BatchSampler(Sampler):
r"""Wraps another sampler to yield a mini-batch of indices.
Args:
sampler (Sampler or Iterable): Base sampler. Can be any iterable object
with ``__len__`` implemented.
batch_size (int): Size of mini-batch.
drop_last (bool): If ``True``, the sampler will drop the last batch if
its size would be less than ``batch_size``
Example:
>>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
>>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]
"""
def __init__(self, sampler, batch_size, drop_last):
# Since collections.abc.Iterable does not check for `__getitem__`, which
# is one way for an object to be an iterable, we don't do an `isinstance`
# check here.
if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \
batch_size <= 0:
raise ValueError("batch_size should be a positive integer value, "
"but got batch_size={}".format(batch_size))
if not isinstance(drop_last, bool):
raise ValueError("drop_last should be a boolean value, but got "
"drop_last={}".format(drop_last))
self.sampler = sampler
self.batch_size = batch_size
self.drop_last = drop_last
def __iter__(self):
batch = []
for idx in self.sampler:
batch.append(idx)
if len(batch) == self.batch_size:
yield batch
batch = []
if len(batch) > 0 and not self.drop_last:
yield batch
def __len__(self):
if self.drop_last:
return len(self.sampler) // self.batch_size
else:
return (len(self.sampler) + self.batch_size - 1) // self.batch_size