Pytorch学习笔记(三)

2018-12-17

本文涉及应用一个基于字符级别的RNN模型训练一个国家名称的分类模型。

我们将建立和训练一个基本的字符级RNN来对单词进行分类。字符级RNN将单词读取为一串字符,在每一步输出一个prediction和“hidden state”,将其先前的隐藏状态输入到下一步。我们将最后的预测作为输出,即单词属于哪一类。

具体来说,我们将对来自18种语言的几千个姓氏进行培训,并根据拼写预测一个名字来自哪种语言。

Recommended Reading

在阅读本文之前,你需要安装了Pytorch、了解Python,以及理解了Tensor。

Task Description

从机器学习的角度来说,这是个分类任务。具体来说,我们将从18种语言的原始语言中训练几千个名字,并根据测试集的名字来预测这个名字来自哪一种语言。本文的训练数据可在此处下载。里面包含18个名为“[Language] .txt”的文本文件。每个文件包含很多名字数据集,每行一个名字。每个文件所包含的名字来自一种语言(比如中文、英文),所以数据集包含的18种语言代表18类。训练的样本是(名字,语言),当模型训练后,输入名字,预测此名字属于哪一类(属于那种语言)。

Processing Flow

PREPARING THE DATA

首先解决编码问题以及将类别放入python list。然后类别对应的名字以字典类型读入,方面后面模型训练使用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
from __future__ import unicode_literals, print_function, division
from io import open
import glob
import os

def findFiles(path): return glob.glob(path)

# print(findFiles('data/names/*.txt'))

import unicodedata
import string

# 所有的英文字母加上五个标点符号" .,;'"。一个57个字符
all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)
# print(all_letters)

# Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
return ''.join(
# NFC表示字符应该是整体组成(比如可能的话就使用单一编码),而NFD表示字符应该分解为多个组合字符表示
c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn'
and c in all_letters
)

# print(unicodeToAscii('Ślusàrski'))

# Build the category_lines dictionary, a list of names per language
category_lines = {}
all_categories = []

# Read a file and split into lines
def readLines(filename):
# strip() 方法用于移除字符串头尾指定的字符(默认为空格或换行符)或字符序列
lines = open(filename, encoding='utf-8').read().strip().split('\n')
# lines 返回的是 ['x1','x2'……'xn']
return [unicodeToAscii(line) for line in lines]

for filename in findFiles('data/names/*.txt'):
# filename 是形如 data/names\\xxx.txt
# os.path.splitext(os.path.basename(filename))返回的是('xxx','.txt')
category = os.path.splitext(os.path.basename(filename))[0]
# 18种类别
all_categories.append(category)
lines = readLines(filename)
# category_lines:字典类型(类别: 名字list)
# 字典{'A':'xx','B':'XX'}
# list:[a,b,c,d...],array:np.array([a,b,c,d...])。二者均可通过下标访问数组元素,但是list不可对整个数组一起运算而array可以
category_lines[category] = lines

n_categories = len(all_categories)
print("all_letters: "+str(len(all_letters)))
print("n_categories: "+str(n_categories))
print("category_lines:"+str(category_lines['Chinese'][:5]))

输出结果:
all_letters: 57
n_categories: 18
category_lines:['Ang', 'AuYong', 'Bai', 'Ban', 'Bao']

Turning Names into Tensors

在PyTorch中,我们需要将名字数据转换成Tensor才能在模型中读入使用。在本文中,最小粒度为字符,意思是我们将名字里面的每个字符都作为一个独立的语言粒度来处理,为了数学化字符,我们这里使用”one-hot vector”来表示,这里每个字符被表示成<1 57="" *="">的向量。由于名字由多个字符组成,所以每个名字就被表示成了2D的矩阵<名字字符个数 1 57>。这个额外的一维是因为PyTorch假设所有的东西都是分批的-我们只是在这里使用1的批次大小。(tensor的形式是 tensor([x,x,x,x]),里面的每个x都是[[x,x,x,x]]形式的)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import torch

# Find letter index from all_letters, e.g. "a" = 0
# 返回字母所在下标
def letterToIndex(letter):
return all_letters.find(letter)

# Just for demonstration, turn a letter into a <1 x n_letters> Tensor
def letterToTensor(letter):
tensor = torch.zeros(1, n_letters)
# tensor是一个list包着一个list,[[],[],[]]的形式
tensor[0][letterToIndex(letter)] = 1
return tensor

# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def lineToTensor(line):
tensor = torch.zeros(len(line), 1, n_letters)
# enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,一般用在 for 循环当中。如返回[(i,a_i),(i+1,a_i+1)]的形式。
for li, letter in enumerate(line):
tensor[li][0][letterToIndex(letter)] = 1
return tensor

print(letterToTensor('J'))
print(lineToTensor('Jones').size())

输出结果:
tensor([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0.]])
torch.Size([5, 1, 57])

Creating the Network

在自动求导之前,在Tensor中创建一个递归的神经网络需要在几个timesteps里克隆出一个层的参数。这些层包含隐藏状态和梯度(完全归属于这个图),这意味着你可以非常“纯”得实现RNN,作为常规的前馈层。

下面的这个RNN module由两个线性层组成,它们作用在输入层和隐藏层上,然后在输出层后面有一个LogSoftmax层。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import torch.nn as nn

class RNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(RNN, self).__init__()

self.hidden_size = hidden_size

self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
self.i2o = nn.Linear(input_size + hidden_size, output_size)
self.softmax = nn.LogSoftmax(dim=1)

def forward(self, input, hidden):
# torch.cat的作用是把两个Tensor连在一起,第二个参数1代表按行(上下)0代表按列(左右)
combined = torch.cat((input, hidden), 1)
hidden = self.i2h(combined)
output = self.i2o(combined)
output = self.softmax(output)
return output, hidden

def initHidden(self):
return torch.zeros(1, self.hidden_size)

n_hidden = 128
rnn = RNN(n_letters, n_hidden, n_categories)

为了运行这个网络的一个步(step),我们需要传入输入(在我们的例子中,当前字母的张量)和一个先前的隐藏状态(初始化为零)。 我们将返回输出(每种语言的概率)和下一个隐藏状态(我们为下一步保留)。下面来测试一下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
input = letterToTensor('A')
hidden =torch.zeros(1, n_hidden)

output, next_hidden = rnn(input, hidden)

输出结果:
tensor([[-2.8834, -2.9605, -2.8716, -2.8836, -2.8435, -3.0421, -2.8215,
-2.8237, -2.9479, -2.9012, -2.9043, -2.7766, -2.9645, -2.8986,
-2.9966, -2.8645, -2.8830, -2.8002]]) tensor([[ 0.0349, 0.0630, 0.0390, -0.1138, -0.0121, 0.0225, 0.0151,
0.0150, -0.0606, 0.1185, -0.0688, -0.0340, 0.0115, -0.0540,
0.0141, -0.0232, 0.0322, -0.0111, 0.0185, -0.0777, -0.0144,
0.0070, -0.0806, 0.0173, -0.0764, -0.0582, 0.0117, 0.0373,
0.0358, -0.0730, -0.1175, 0.0121, 0.0851, 0.0514, 0.0251,
-0.0029, -0.0581, -0.0684, -0.0295, -0.0176, -0.0717, -0.0114,
0.1108, -0.0850, -0.0092, 0.0557, -0.0428, 0.0215, 0.0270,
-0.0594, -0.0791, -0.0117, 0.0963, -0.0552, 0.0348, 0.0199,
-0.1099, -0.0455, -0.0050, 0.0466, 0.0120, -0.0765, 0.0904,
0.0951, 0.0350, 0.0016, 0.0220, -0.1223, 0.0892, 0.0187,
-0.0113, 0.0333, -0.0876, 0.0420, -0.0724, -0.0900, 0.0470,
0.1084, -0.0746, 0.0001, 0.0609, -0.0043, -0.0224, 0.0867,
-0.0062, 0.0764, -0.0261, 0.0413, 0.0657, -0.0280, -0.0594,
-0.0602, 0.0622, 0.1100, 0.0487, -0.0218, 0.0676, 0.0755,
-0.0052, 0.0319, 0.0267, -0.0070, 0.0734, -0.0206, 0.0219,
-0.0242, 0.0549, -0.0286, 0.0876, 0.0802, -0.0669, -0.0160,
-0.1227, 0.1216, 0.0064, -0.0975, -0.0887, 0.0880, -0.0095,
-0.0132, 0.0181, -0.0699, 0.0252, -0.0033, -0.0169, 0.0660,
-0.0262, 0.0060]])

为了提高效率,我们不想为每一步创建一个新的张量,所以我们将使用lineToTensor而不是letterToTensor。并且我们将会使用切片。这可以通过预计算Tensor的batch来优化运算速度。

1
2
3
4
5
6
7
8
9
10
input = lineToTensor('Albert')
hidden = torch.zeros(1, n_hidden)

output, next_hidden = rnn(input[0], hidden)
print(output)

输出结果:
tensor([[-2.8209, -2.9620, -2.7678, -2.9134, -2.9096, -2.9340, -2.9292,
-2.8248, -2.9066, -2.8925, -2.8650, -2.8142, -3.0028, -2.8984,
-2.8664, -2.9377, -2.8813, -2.9289]])

如你所见,输出是<1 x n_categories>张量,其中每一项都是该类别的可能性likelihood。有关对数似然损失函数请查看此文。

Training

Preparing for Training

在接受训练之前,我们应该写一些辅助函数。首先是解释网络的输出,tensor的每个值都代表了当前输入归档于此类别的可能性。我们可以用Tensor.topk来获得最大值的索引:

1
2
3
4
5
6
7
8
9
def categoryFromOutput(output):
top_n, top_i = output.topk(1)
category_i = top_i[0].item()
return all_categories[category_i], category_i

print(categoryFromOutput(output))

输出结果:
('Dutch', 3)

官方教程里还写了一个快速获取训练样例的函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import random

def randomChoice(l):
return l[random.randint(0, len(l) - 1)]

def randomTrainingExample():
category = randomChoice(all_categories)
line = randomChoice(category_lines[category])
category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
line_tensor = lineToTensor(line)
return category, line, category_tensor, line_tensor

for i in range(10):
category, line, category_tensor, line_tensor = randomTrainingExample()
print('category =', category, '/ line =', line)

输出结果:
category = Greek / line = Papadelias
category = Greek / line = Demakis
category = German / line = Siskind
category = Russian / line = Abduloff
category = Russian / line = Kaberman
category = Vietnamese / line = Ngo
category = Chinese / line = Jue
category = Scottish / line = Kerr
category = English / line = Eastwood
category = Vietnamese / line = Thuy

Training the Network

现在训练这个网络所需要的就是给它展示一堆训练样例,让它去猜测,如果它错了就告诉它来帮助他修正。
这里我们使用的损失函数是 nn.NLLLoss,RNNN的最后一层是 nn.LogSoftmax.

1
criterion = nn.NLLLoss()

每一轮训练都包括以下步骤:

  1. 创建输入和目标张量
  2. 创建一个初始隐藏状态(初始化为0)
  3. 读入该步的每个字母然后保持该步的隐藏状态,将此隐藏状态和下一步的字母输入一起组成下一步输出
  4. 比较最终输出结果和标记目标
  5. 反向传播(并且更新参数)
  6. 返回输出和损失
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn

def train(category_tensor, line_tensor):
# line_tensor是输入张量(是一个) category_tensor是目标张量

# 创建一个初始隐藏状态(初始化为0)
hidden = rnn.initHidden()

rnn.zero_grad()

for i in range(line_tensor.size()[0]):
output, hidden = rnn(line_tensor[i], hidden)

loss = criterion(output, category_tensor)
loss.backward()

# Add parameters' gradients to their values, multiplied by learning rate
for p in rnn.parameters():
p.data.add_(-learning_rate, p.grad.data)

return output, loss.item()

测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import time
import math

n_iters = 100000
print_every = 5000
plot_every = 1000

# Keep track of losses for plotting
current_loss = 0
all_losses = []

def timeSince(since):
now = time.time()
s = now - since
m = math.floor(s / 60)
s -= m * 60
return '%dm %ds' % (m, s)

start = time.time()

for iter in range(1, n_iters + 1):
category, line, category_tensor, line_tensor = randomTrainingExample()
output, loss = train(category_tensor, line_tensor)
current_loss += loss

# Print iter number, loss, name and guess
if iter % print_every == 0:
guess, guess_i = categoryFromOutput(output)
correct = '✓' if guess == category else '✗ (%s)' % category
print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))

# Add current loss avg to list of losses
if iter % plot_every == 0:
all_losses.append(current_loss / plot_every)
current_loss = 0

输出结果:
5000 5% (0m 7s) 2.7471 Eckstein / Russian ✗ (German)
10000 10% (0m 15s) 2.3858 Akera / Spanish ✗ (Japanese)
15000 15% (0m 23s) 2.1613 Borde / English ✗ (French)
20000 20% (0m 31s) 0.8195 Kerner / German ✓
25000 25% (0m 40s) 2.1763 Gately / French ✗ (English)
30000 30% (0m 49s) 1.1153 Didrikil / Russian ✓
35000 35% (0m 57s) 0.5463 Yim / Korean ✓
40000 40% (1m 5s) 0.5045 Stevenson / Scottish ✓
45000 45% (1m 13s) 0.9374 Polymenakou / Greek ✓
50000 50% (1m 20s) 1.9215 Prchal / Irish ✗ (Czech)
55000 55% (1m 28s) 1.3433 Maciomhair / Irish ✓
60000 60% (1m 36s) 2.0990 Chemlik / Scottish ✗ (Czech)
65000 65% (1m 44s) 0.2319 Seghers / Dutch ✓
70000 70% (1m 52s) 1.3295 Mcmahon / Irish ✓
75000 75% (2m 0s) 2.3238 Cruz / Spanish ✗ (Portuguese)
80000 80% (2m 7s) 2.3112 William / Scottish ✗ (Irish)
85000 85% (2m 15s) 1.3948 Brown / Irish ✗ (Scottish)
90000 90% (2m 23s) 0.4818 Oberti / Italian ✓
95000 95% (2m 31s) 0.1396 Ukiyo / Japanese ✓
100000 100% (2m 39s) 2.3235 Shahin / Arabic ✗ (Russian)

Plotting the Results

接下来我们用可视化图的形式描绘出all_losses的变化趋势。

1
2
3
4
5
6
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.figure()
plt.plot(all_losses)
plt.show()

Evaluating the Results

为了查看网络在不同类别上的性能,我们将创建一个混淆矩阵,为网络猜测的语言(列名)指示其对应的实际使用语言(行名)。为了计算混淆矩阵,我们将选择一些数据在网络中使用evaluate()函数,这个函数和train()函数差不多但是它少了回传。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Keep track of correct guesses in a confusion matrix
confusion = torch.zeros(n_categories, n_categories)
n_confusion = 10000

# Just return an output given a line
def evaluate(line_tensor):
hidden = rnn.initHidden()

for i in range(line_tensor.size()[0]):
output, hidden = rnn(line_tensor[i], hidden)

return output

# Go through a bunch of examples and record which are correctly guessed
for i in range(n_confusion):
category, line, category_tensor, line_tensor = randomTrainingExample()
output = evaluate(line_tensor)
guess, guess_i = categoryFromOutput(output)
category_i = all_categories.index(category)
confusion[category_i][guess_i] += 1

# Normalize by dividing every row by its sum
for i in range(n_categories):
confusion[i] = confusion[i] / confusion[i].sum()

# Set up plot
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(confusion.numpy())
fig.colorbar(cax)

# Set up axes
ax.set_xticklabels([''] + all_categories, rotation=90)
ax.set_yticklabels([''] + all_categories)

# Force label at every tick
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

# sphinx_gallery_thumbnail_number = 2
plt.show()


你可以从主轴上找出亮点,显示它猜错了哪种语言,例如汉语被认成了韩语,Portuguese葡萄牙语被认成了西班牙语。它似乎在Greek方面做得很好,在英语方面则很差(也许是因为与其他语言的重叠)。

Running on User Input

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def predict(input_line, n_predictions=3):
print('\n> %s' % input_line)
with torch.no_grad():
output = evaluate(lineToTensor(input_line))

# tensor.data.topk(x) 返回第x大的数,会返回一个tuple,里面包含两个tensor,分别是第x大值 和 第一次出现这个数的下标值
topv, topi = output.topk(n_predictions, 1, True)
predictions = []

for i in range(n_predictions):
value = topv[0][i].item()
category_index = topi[0][i].item()
print('(%.2f) %s' % (value, all_categories[category_index]))
predictions.append([value, all_categories[category_index]])

predict('Dovesky')
predict('Jackson')
predict('Satoshi')

输出结果:
> Dovesky
(-0.32) Russian
(-1.68) Czech
(-2.80) English

> Jackson
(-0.67) Scottish
(-1.17) Russian
(-2.38) English

> Satoshi
(-1.15) Arabic
(-1.74) Polish
(-2.13) Italian

官网给的代码样例可在github上找到。

本文来源:「想飞的小菜鸡」的个人网站 vodkazy.cn

版权声明:本文为「想飞的小菜鸡」的原创文章,采用 BY-NC-SA 许可协议,转载请附上原文出处链接及本声明。

原文链接:https://vodkazy.cn/2018/12/17/Pytorch学习笔记(三)

支付宝打赏 微信打赏

如果文章对你有帮助,欢迎点击上方按钮打赏作者,更多文章请访问想飞的小菜鸡