通過例子學習PyTorch

原文:LEARNING PYTORCH WITH EXAMPLES
作者:Justin Johnson
翻譯:Jerry
時間:2019-01-22

PyTorch: Tensors

Numpy是一個很好的框架,但是它不能利用GPU來加速數值計算。對於現在的深度網絡而言,GPU通常能提供50倍或更高的提速,所以很不幸的是,Numpy對深度學習來說是不夠的。

這裏我們介紹PyTorch中最基礎的概念:Tensor。它與numpy array是同一個東西:一個Tensor是一個n維數組,PyTorch爲操作Tensor提供了很多函數。Tensor能在後臺持續跟蹤計算圖和梯度,並且它對科學計算來說也是一個通用的工具。

與numpy不同,PyTorch的Tensor可以利用GPU來甲酸數值計算。要在GPU上使用Tensor,只需要將其轉換成一個新的數據類型。

這裏我們使用PyTorch的Tensor來實現一個兩層網絡以匹配隨機數據。與numpy一樣,這裏我們需要自己實現神經網絡的前向傳播和反向傳播:

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # 這行代碼將使得PyTorch運行在GPU上

# N 是 batch size; D_in 是 輸入層節點數;
# H 是 隱藏層節點數; D_out 是 輸出層節點數.
N, D_in, H, D_out = 64, 1000, 100, 10

# 隨機生成輸入和輸出
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# 隨機初始化權重
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # 前向傳播:計算神經網絡的預測值
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # 計算、顯示損失值
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)

    # 反向傳播:計算w1和w2的梯度
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # 使用梯度更新權重
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
    

Autograd

PyTorch: Tensors and autograd

在上面的例子中,我們手動實現了神經網絡的前向傳播和反向傳播。在一個小網絡(例如上面的2層網絡)中實現反向傳播還不算太難,但是對於一個大型複雜網絡而言,這是一件非常棘手的事情。

值得慶幸的是,我們可以使用自動微分算法來實現神經網絡中的反向傳播。PyTorch中的autograd包正好提供了這個功能。當使用autogra的時候,神經網絡的前向傳播定義了計算圖,圖中的節點是Tensor,邊是一個從輸入Tensor產生輸出Tensor的函數。通過計算圖進行反向傳播可以很容易得到梯度。

這聽起來很複雜,其實它在使用的時候非常簡單。每一個Tensor都代表着計算圖中的一個節點,那些設置了x.requires_grad=True的Tensorx,可以通過x.grad來得到x在當前值下的梯度。

這裏我們使用PyTorch的Tensor和autograd來實現我們的兩層網絡,現在我們不再需要手動實現反向傳播了:

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # 這行代碼將使得PyTorch運行在GPU上

# N 是 batch size; D_in 是 輸入層節點數;
# H 是 隱藏層節點數; D_out 是 輸出層節點數.
N, D_in, H, D_out = 64, 1000, 100, 10

# 隨機生成輸入和輸出
# 設置requires_grad=False表示
# 在反向傳播中不需要計算梯度
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# 隨機初始化權重
# 設置requires_grad=True表示
# 在反向傳播中需要計算其梯度
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # 前向傳播:計算神經網絡的預測值
    # 這與使用Tensor做前向傳播一樣,但是我們不需要保存中間值
    # 因爲我們在反向傳播中無需手動計算梯度
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # 計算、顯示損失值
    # loss是一個形狀爲(1, )的Tensor
    # loss.item()可以拿到loss中的標量值
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # 使用autograd完成反向傳播。這行代碼會計算所有參與loss運算的Tensor中
    # 帶有requires_grad=True的那些Tensor的梯度
    # 運行後,w1.grad,w2.grad就是在當前loss下,w1和w2的梯度
    loss.backward()

    # 手動使用梯度更新權重。使用torch.no_grad()包起來
    # 因爲權重帶有requires_grad=True,但我們不需要跟蹤歷史
    # 另外一個選擇是使用weight.data和weight.grad.data,它們不會跟蹤歷史
    # 你也可以使用torch.optim.SGG來實現
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # 完成更新後,手動將梯度清零
        w1.grad.zero_()
        w2.grad.zero_()

(後面的代碼比較簡單,就不再翻譯註釋了)

PyTorch: Defining new autograd functions

在後臺,每個原始atuograd操作符其實是兩個對Tensor操作的函數。一個是forward函數,它負責將輸入Tensor計算得到輸出Tensor;一個是backward函數,它接受輸出Tensor的在當前值下的梯度,計算得到輸入Tensor在當前值下的梯度。

在PyTorch中,如果我們想定義自己的autograd操作符,繼承torch.autograd.Function類,並實現forward、backward兩個函數即可。之後我們使用它來構造實例,傳入含輸入數據的Tensor,就像調用函數一樣。

在這次的例子中,我們先定義自己的autograd函數以實現非線性的ReLU,然後使用它實現兩層網絡:

class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

TensorFlow: Static Graphs

PyTorch的autograd看起來和Tensorflow很像:兩個框架都定義了計算圖,並使用了自動微分來計算梯度。PyTorch和TensorFlow最大的不同在於,TensorFlow使用的是靜態圖,而PyTorch使用動態圖

在TensorFlow中,我們只定義一次計算圖,之後重複執行這個計算圖,每次計算時可能使用不同的輸入數據。在PyTorch中,每一次forward都定義了一個新的計算圖。

靜態圖很好,因爲可以事先進行優化。例如,一個框架可能決定融合一些圖相關的操作以提高效率,或者使用一種策略來讓圖在多塊GPU或多個機器上進行分佈式操作。如果重複使用同一張圖,那麼代價比較高的前期優化可以被分攤,因爲相同的圖會被反覆運行。

靜態圖和動態圖的一個不同的方面是控制流。對於某些模型來說,我們希望它能爲每一個數據點做不同的計算。例如,可能爲了針對每個數據點的不同時間步而展開循環神經網絡,這個展開操作可以通過循環語句實現。在靜態圖中,這種循環結構是圖的一部分,因此,TensorFlow提供了像tf.scan這樣的操作來將循環嵌入到圖裏。使用動態圖的話,這將變得很簡單,因爲我們爲每個實例動態的構建圖,我們可以使用常規的流程控制來對每一個輸入完成不同的計算。

nn module

PyTorch: nn

計算圖和autograd對於定義複雜運算和自動求導是非常有用的,然而對於一個大型神經網絡來說,直接使用autograd未免太低級了。

當構建神經網絡時,我們經常想把計算都當成是(layers),其中那些帶有可學習的參數的層在訓練過程中是可以被優化的。

在TensorFlow裏,像Keras、TensorFlow-Slim和TFLearn這些包都提供了基於計算圖的高級抽象,這對構建神經網絡來說是非常有用的。

在PyTorch中,nn包有着同樣的目的。nn包定義了許多Module,它和神經網絡的層大致一樣。一個Module接受輸入Tensor,計算輸出Tensor,但是可能也保存了內部狀態,比如那些包含可學習的參數的Tensor。nn包當然也定義了一些有用損失函數,它們在訓練神經網絡時是很常用的。

在這個例子中,我們使用nn包來實現我們的兩層網絡:

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

PyTorch: optim

到目前爲止,我們都是通過手動改變那些帶有可訓練參數的Tensor的值來更新權重的(使用torch.no_grad() 或者 .data 來避免autograd生效)。這對簡單地優化算法(比如隨機梯度下降)而言不是太大的負擔,但是在實踐中,我們經常使用更復雜的優化器例如AdaGrad、RMSProp、Adam等等。

PyTorch的optim包抽象了優化算法的概念,實現了常用的優化算法。

在這個例子中,我們將使用nn包定義我們的模型,並且使用optim包的Adam算法來優化模型:

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

PyTorch: Custom nn Modules

有時候你想定義比現有Module更復雜的模型。這時你可以通過繼承nn.Module類來實現,同時需要定義foward方法,使用其他module或者autograd操作,接受輸入Tensor並得到輸出Tensor。

在這次例子中,我們使用自定義的Module來實現兩層網絡:

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

PyTorch: Control Flow + Weight Sharing

這是一個動態圖和權重共享的例子,我們實現了一個非常奇怪的模型:一個全連接的ReLU網絡,它在每一次前向傳播中,先選擇一個1-4的隨機數代表隱藏層的數量,多次重複使用相同的權重來計算最裏面的隱藏層。

我們使用常規的Python控制流來實現這個循環,而且在前向傳播中,我們可以通過多次重複使用同一個Module來實現最裏層的權重共享。

通過基礎爲Module的子類來實現這一模型:

class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章