我想去面试系列——BERT源码品读

2020-10-14

细读BERT-Pytorch源码,附带模块解析和一点思考。

源码地址:https://github.com/codertimo/BERT-pytorch,过了一遍之后感觉这个repo其实bug挺多的..后续作者修了一点bug在 https://github.com/codertimo/BERT-pytorch/tree/alpha0.0.1a5。本文是参照两次的版本综合进行解析。

代码目录说明

我们后续的说明都认为根目录在bert_pytorch目录下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
├──── bert_pytorch/
│ ├──── __main__.py
│ │
│ ├──── dataset/
│ │ ├──── dataset.py
│ │ └──── vocab.py
│ │
│ ├──── model/
│ │ │
│ │ ├──── attention/
│ │ │ ├──── multi_head.py
│ │ │ └──── single.py
│ │ │
│ │ ├──── embedding/
│ │ │ ├──── bert.py
│ │ │ ├──── position.py
│ │ │ ├──── segment.py
│ │ │ └──── token.py
│ │ │
│ │ ├──── utils/
│ │ │ ├──── feed_forward.py
│ │ │ ├──── gelu.py
│ │ │ ├──── layer_norm.py
│ │ │ └──── sublayer.py
│ │ │
│ │ ├──── bert.py
│ │ ├──── language_model.py
│ │ └──── transformer.py
│ │
│ └──── trainer/
│ ├──── optim_schedule.py
│ └──── pretrain.py

├──── LICENSE
├──── Makefile
├──── README.md
├──── requirements.txt
└──── setup.py

解析参数

__main__.py主入口进来先解析命令行参数,ArgumentParser(),包括default、type、required、help、default等参数设置。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# __main_.py 11行~38行

parser = argparse.ArgumentParser()

parser.add_argument("-c", "--train_dataset", required=True, type=str, help="train dataset for train bert")
parser.add_argument("-t", "--test_dataset", type=str, default=None, help="test set for evaluate train set")
parser.add_argument("-v", "--vocab_path", required=True, type=str, help="built vocab model path with bert-vocab")
parser.add_argument("-o", "--output_path", required=True, type=str, help="ex)output/bert.model")

parser.add_argument("-hs", "--hidden", type=int, default=256, help="hidden size of transformer model")
parser.add_argument("-l", "--layers", type=int, default=8, help="number of layers")
parser.add_argument("-a", "--attn_heads", type=int, default=8, help="number of attention heads")
parser.add_argument("-s", "--seq_len", type=int, default=20, help="maximum sequence len")
parser.add_argument("-d", "--dropout", type=float, default=0.1, help="dropout rate")

parser.add_argument("-b", "--batch_size", type=int, default=64, help="number of batch_size")
parser.add_argument("-e", "--epochs", type=int, default=10, help="number of epochs")
parser.add_argument("-w", "--num_workers", type=int, default=5, help="dataloader worker size")

parser.add_argument("--with_cuda", type=bool, default=True, help="training with CUDA: true, or false")
parser.add_argument("--log_freq", type=int, default=10, help="printing loss every n iter: setting n")
parser.add_argument("--corpus_lines", type=int, default=None, help="total number of lines in corpus")
parser.add_argument("--cuda_devices", type=int, nargs='+', default=None, help="CUDA device ids")
parser.add_argument("--on_memory", type=bool, default=False, help="Loading on memory: true or false")

parser.add_argument("--lr", type=float, default=1e-3, help="learning rate of adam")
parser.add_argument("--adam_weight_decay", type=float, default=0.01, help="weight_decay of adam")
parser.add_argument("--adam_beta1", type=float, default=0.9, help="adam first beta value")
parser.add_argument("--adam_beta2", type=float, default=0.999, help="adam first beta value")

args = parser.parse_args()

读词表

vocab.py里有三个类TorchVocabVocabWordVocab,他们从左到右是依次继承的关系。构造词表整体的思路是使用集合中的Counter计数器,按照出现频率进行大到小排序,建立词与索引的映射字典,当字典的大小等于词表的大小,或者当前词的出现的频率小于设定的最小出现频率时,词典建立完成。

1
2
3
4
5
# __main_.py 40行~42行

print("Loading Vocab", args.vocab_path)
vocab = WordVocab.load_vocab(args.vocab_path)
print("Vocab Size: ", len(vocab))

TorchVocab类里有三个属性,freqs代表语料库中每个token出现的次数、stoi是一个defaultdict负责将token string映射到数字Interger、itos是一个list里面的每个index对应着一个token string。__init__(self, counter, max_size=None, min_freq=1, specials=['<pad>', '<oov>'], vectors=None, unk_init=None, vectors_cache=None)会进行添加特殊token扩展词表specials; 对token按照出现的次数对itos进行排序,出现次数小于min_freq的token将不被添加进词表;此外还有向量初始化、unk初始化设置、向量缓存等设置。vocab_rerank方法可以根据已经排序好的itos对stoi进行重排序。extend方法对字典进行扩展,参数是待扩展的字符串[a,b,c]。

Vocab类初始化时添加了五个token,分别是<pad>, <unk>, <eos>, <sos>, <mask>,分别对应index 0~4。这里的eos对应于clssos对应于septo_seqfrom_seq负责stoi和itos之间的转化。load_vocabsave_vocab负责从文件中加载/存储词表。

WordVocab初始化时__init__(self, texts, max_size=None, min_freq=1)先搞个Counter统计一下各个token的次数,然后调用父类Vocab的构造函数并把这个counter传参进去进行初始化。同时重写了to_seqfrom_seqload_vocab三个方法,这里的to_seq可以实现为自动加[eos][sos]

获取训练集和测试集

1
2
3
4
5
6
7
8
9
10
11
# __main_.py 44行~55行

print("Loading Train Dataset", args.train_dataset)
train_dataset = BERTDataset(args.train_dataset, vocab, seq_len=args.seq_len, corpus_lines=args.corpus_lines, on_memory=args.on_memory)

print("Loading Test Dataset", args.test_dataset)
test_dataset = BERTDataset(args.test_dataset, vocab, seq_len=args.seq_len, on_memory=args.on_memory) if args.test_dataset is not None else None

print("Creating Dataloader")
train_data_loader = DataLoader(train_dataset, batch_size=args.batch_size, num_workers=args.num_workers)
test_data_loader = DataLoader(test_dataset, batch_size=args.batch_size, num_workers=args.num_workers) if test_dataset is not None else None

BERTDataset

BERTDataset类继承自Dataset类。实际上BERT的两个训练任务的随机取样(随机mask token和随机采样next sentence都是在dataset读取时完成的)。__init__()负责初始化一些变量如数据集路径corpus_path、词表vocab、最大序列长度seq_len、是否加载至内存on_memory、数据集内数据的条数corpus_lines、编码encoding,__len____getitem__就是对Dataset类重写的正常操作,分别返回整个dataset的长度和想要获取的单个元素。

random_word(sentence)的作用是随机mask token,输入参数是token序列,输出是token id list和label list。具体代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# dataset/dataset.py 63行~90行

tokens = sentence.split()
output_label = []
for i, token in enumerate(tokens): # 这里的15%其实写的不对,应该是语料中15%的token被选择,而不是每个token都有15%的机会被选择,读者需要注意一下。
prob = random.random()
if prob < 0.15:
# 随机选择15%的token进行mask
prob /= 0.15
if prob < 0.8:
# 80%的token转化成<mask>
tokens[i] = self.vocab.mask_index
elif prob < 0.9:
# 10%的token随机转化成其他token
tokens[i] = random.randrange(len(self.vocab))
else:
# 10%的token保持原token,如果token在stoi不存在就返回unk_index
tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))
else:
tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
output_label.append(0)
return tokens, output_label

random_sent(index)的作用是随机采样next sentence,输入参数是index行号,输出是sen1、sen2和label。具体代码如下:

1
2
3
4
5
6
7
8
# dataset/dataset.py 92行~99行

t1, t2 = self.get_corpus_line(index)
# output_text, label(isNotNext:0, isNext:1)
if random.random() > 0.5:
return t1, t2, 1
else:
return t1, self.get_random_line(), 0

那么给定一个index之后如何__getitem__(index)获取一条数据呢?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# dataset/dataset.py 37行~61行

t1, t2, is_next_label = self.random_sent(index) # 随机选s1 s2
t1_random, t1_label = self.random_word(t1) # 随机mask单词
t2_random, t2_label = self.random_word(t2) # 随机mask单词

# [CLS] tag = SOS tag, [SEP] tag = EOS tag
t1 = [self.vocab.sos_index] + t1_random + [self.vocab.eos_index]
t2 = t2_random + [self.vocab.eos_index] # 添加头尾特殊标签

t1_label = [self.vocab.pad_index] + t1_label + [self.vocab.pad_index]
t2_label = t2_label + [self.vocab.pad_index] # label也要对应加上,用于MLM

segment_label = ([1 for _ in range(len(t1))] + [2 for _ in range(len(t2))])[:self.seq_len] # 添加segment,根据属于s1的token都标1,属于s2的都标2
bert_input = (t1 + t2)[:self.seq_len] # 两句话拼起来成一个整句
bert_label = (t1_label + t2_label)[:self.seq_len] # segment label也拼起来

padding = [self.vocab.pad_index for _ in range(self.seq_len - len(bert_input))] # 不够seq_len长的用pad_index补齐
bert_input.extend(padding), bert_label.extend(padding), segment_label.extend(padding)

output = {"bert_input": bert_input,
"bert_label": bert_label,
"segment_label": segment_label,
"is_next": is_next_label}

return {key: torch.tensor(value) for key, value in output.items()}

除此之外get_corpus_line(item)是为了获取一条预料(s1, s2),get_random_line()是为了获取单独一句话s2。不再赘述。

DataLoader(dataset, batch_size, num_workers)获取dataloader也是正常操作,就不多说了。

建模BERT

BERTEmbedding

BERTEmbedding考虑三种embedding,TokenEmbeddingPositionalEmbeddingSegmentEmbedding

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# model/embedding/token.py

class TokenEmbedding(nn.Embedding):
def __init__(self, vocab_size, embed_size=512):
super().__init__(vocab_size, embed_size, padding_idx=0)

# model/embedding/segment.py
class SegmentEmbedding(nn.Embedding):
def __init__(self, embed_size=512):
super().__init__(3, embed_size, padding_idx=0) # 这里的0、1、2对应于mask、s1、s2

# model/embedding/position.py(这是原始transformer的写法,positionEmbedding是固定的向量表示)
class PositionalEmbedding(nn.Module):
def __init__(self, d_model, max_len=512): # d_model实际上就是embed_size
super().__init__()
# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model).float()
pe.require_grad = False
# 以下这些代码需要理解,面试会考哦
position = torch.arange(0, max_len).float().unsqueeze(1)
div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()
odd_len = d_model - div_term.size(-1)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term[:odd_len]) # 源代码对于奇数的处理不太对,这里做了修改
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
return self.pe[:, :x.size(1)]

# model/embedding/position.py(现在的写法,直接继承父类了,是让BERT直接学一个0 1 2 3 ... 分别对应的向量表示)
class PositionalEmbedding(nn.Embedding):
def __init__(self, d_model, max_len=512):
super().__init__(max_len, d_model)
def forward(self, x):
return self.weight.data[:x.size(1)]

注意一点的是,BERT的positionEmbedding是随机初始化,然后目的是学0 1 2 3…等数字对应的向量表示的,而原始的Transformer的positionEmbedding是用sincos特定计算不变的,这其实是Transformer和BERT最大的区别之一。原始Transformer的PositionalEmbedding的计算是有数学理论的,见这里。具体计算公式如下:

整个BertEmbedding相当于就是把三个embedding直接加起来。至于为什么能加呢?见知乎讨论,实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# model/embedding/bert.py

class BERTEmbedding(nn.Module):
"""
BERT Embedding which is consisted with under features
1. TokenEmbedding : normal embedding matrix
2. PositionalEmbedding : adding positional information
2. SegmentEmbedding : adding sentence segment info, (sent_A:1, sent_B:2)
sum of all these features are output of BERTEmbedding
"""

def __init__(self, vocab_size, embed_size, dropout=0.1):
"""
:param vocab_size: total vocab size
:param embed_size: embedding size of token embedding
:param dropout: dropout rate
"""
super().__init__()
self.token = TokenEmbedding(vocab_size=vocab_size, embed_size=embed_size)
self.position = PositionalEmbedding(d_model=self.token.embedding_dim)
self.segment = SegmentEmbedding(embed_size=self.token.embedding_dim)
self.dropout = nn.Dropout(p=dropout)
self.embed_size = embed_size

def forward(self, sequence, segment_label):
x = self.token(sequence) + self.position(sequence) + self.segment(segment_label)
return self.dropout(x) # 这里的dropout接在了embedding层后面来防止过拟合

Transformer △

Transformer的话是BERT的核心,也是面试必考的知识点。由于BERT是Transformer的Encoder,所以我们这里只考虑Encoder部分。它主要由两部分组成:一个是MultiHeadAttention还有一个就是带Residual和LayerNorm的FFN。放个图在这里方便看。

我们先来看Attention部分,多头注意力(Multi-headed attention)机制方法,在编码器和解码器中大量的使用了single Attention机制。由于Transformer的输入只有一个,所以里面用到的Attention都是Self-Attention。

Single Attention

在计算attention时主要分为三步,第一步是将query和每个key进行相似度计算得到权重,常用的相似度函数有点积、拼接、感知机等;然后第二步一般是使用一个softmax函数对这些权重进行归一化;最后将权重和相应的键值value进行加权求和得到最后的attention。目前在NLP研究中,key和value常常都是同一个,即key=value。

BERT这里的缩放点积Scaled Dot-Product就是计算Q与K之间的点乘的时候,为了防止其结果过大,会除以一个尺度标度$\sqrt{d_k}$,其中$\sqrt{d_k}$为一个query和key向量的维度。因为是用的自注意力,所以$d_{query}=d_{key}=d_{value}$,再利用Softmax操作将其结果归一化为概率分布,然后再乘以矩阵V就得到权重求和的表示。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# model/attention/single.py

class Attention(nn.Module):
"""
Compute 'Scaled Dot Product Attention
"""
def forward(self, query, key, value, mask=None, dropout=None):
"""
Args: query, key, value 同源且 shape 相同
query: [batch_size, head_num, seq_len, dim]
key: [batch_size, head_num, seq_len, dim]
value: [batch_size, head_num, seq_len, dim]
"""
scores = torch.matmul(query, key.transpose(-2, -1)) \
/ math.sqrt(query.size(-1))
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = F.softmax(scores, dim=-1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn

Multi Head Attention

Multi-head Attention其实就是多个single Attention结构的结合,每个attention head学习到在不同表示空间中的特征,搞h个头会比只有1个头能学到更多的信息。具体实现是Query,Key,Value首先分别各自过一个线性变换(这里的变换矩阵是不一样的),然后输入到放缩点积attention,注意这里要做h次,其实也就是所谓的多头,每一次算一个头。而且每次Q,K,V进行线性变换的参数W是不一样的。然后将h次的放缩点积attention结果进行拼接,再进行一次线性变换得到的值作为多头attention的结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# model/attention/multi_head.py

class MultiHeadedAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1):
"""
:param h: head的个数
:param d_model: hidden_size
"""
super().__init__()
assert d_model % h == 0
self.d_k = d_model // h # We assume d_v always equals d_k
self.h = h
self.linear_layers = nn.ModuleList([nn.Linear(d_model, d_model) for _ in range(3)])
self.output_linear = nn.Linear(d_model, d_model)
self.attention = Attention()

self.dropout = nn.Dropout(p=dropout)

def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = [l(x).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linear_layers, (query, key, value))]
# 2) Apply attention on all the projected vectors in batch.
x, attn = self.attention(query, key, value, mask=mask, dropout=self.dropout)
# 3) "Concat" using a view and apply a final linear.
x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)
# 4) Applying Output Linear Model
x = self.output_linear(x)
return x

SublayerConnection

这部分其实就是实现 Add & Normalize,ResidualConnection和LayerNorm。根据transformers包对于这部分的实现,应该是先Add再Dropout再Norm。注意Attention和FFN的Sublayer是不同的,而非共享参数。我在读这里的时候存在一个疑问是LayerNorm为什么用的是《Attention All you need》里的Annotated Transformer自己实现的LayerNorm,而不用标准的 torch.nn.LayerNorm?最后在issue里我发现这里其实两个东西的差别不大,用哪个都行,只不过Annotated Transformer返回的梯度类型是ThAddBackwardtorch.nn.LayerNorm返回的是 AddcmulBackward,并且值虽然不一样但都还是归一化的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# model/utils/layer_norm.py

class LayerNorm(nn.Module):
"Construct a layernorm module (See citation for details)."
def __init__(self, features, eps=1e-6):
super(LayerNorm, self).__init__()
self.a_2 = nn.Parameter(torch.ones(features))
self.b_2 = nn.Parameter(torch.zeros(features))
self.eps = eps

def forward(self, x):
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

# model/utils/sublayer.py
class SublayerConnection(nn.Module):
"""
A residual connection followed by a layer norm.
"""
def __init__(self, size, dropout):
super(SublayerConnection, self).__init__()
self.norm = LayerNorm(size)
self.dropout = nn.Dropout(dropout)

def forward(self, x, sublayer, dropout=True): # 这里传参要传一个forward实例化的函数
"Apply residual connection to any sublayer with the same size."
return self.norm(x + self.dropout(sublayer(x)) if dropout else sublayer(x)) # 先dropout再add再norm

FeedForwardNetwork

之前的输入要传给FFN,这里的FFN实际上就是两个线性层+dropout+GELU,然后FFN的输出再后续传给SublayerConnection。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# model/utils/feed_forward.py

class FeedForward(nn.Module):
"Implements FFN equation."
def __init__(self, d_model, d_ff):
super(FeedForward, self).__init__()
self.w_1 = nn.Linear(d_model, d_ff)
self.w_2 = nn.Linear(d_ff, d_model)
self.activation = GELU()

def forward(self, x):
x = self.w_1(x)
x = self.activation(x)
x = self.w_2(x)
return x

Transformer集成

最后的实现就是将前面几部分串起来,最后加个dropout即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# model/transformer.py

class TransformerBlock(nn.Module):
"""
Bidirectional Encoder = Transformer (self-attention)
Transformer = MultiHead_Attention + Feed_Forward with sublayer connection
"""
def __init__(self, hidden, attn_heads, feed_forward_hidden, dropout):
"""
:param hidden: hidden size of transformer
:param attn_heads: head sizes of multi-head attention
:param feed_forward_hidden: feed_forward_hidden, usually 4*hidden_size
:param dropout: dropout rate
"""
super().__init__()
self.attention = MultiHeadedAttention(h=attn_heads, d_model=hidden, dropout=dropout)
self.feed_forward = FeedForward(d_model=hidden, d_ff=feed_forward_hidden)
self.input_sublayer = SublayerConnection(size=hidden, dropout=dropout)
self.output_sublayer = SublayerConnection(size=hidden, dropout=dropout)
self.dropout = nn.Dropout(p=dropout)

def forward(self, x, mask):
x = self.input_sublayer(x, lambda _x: self.attention.forward(_x, _x, _x, mask=mask))
x = self.output_sublayer(x, lambda _x: self.feed_forward.forward(_x))
return x

BERTModel

以上的东西一拼,就成了BERT模型了。输入是token id sequence和segment label sequence。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# model/bert.py

class BERT(nn.Module):
"""
BERT model : Bidirectional Encoder Representations from Transformers.
"""
def __init__(self, vocab_size, hidden=768, n_layers=12, attn_heads=12, dropout=0.1):
"""
:param vocab_size: vocab_size of total words
:param hidden: BERT model hidden size
:param n_layers: numbers of Transformer blocks(layers)
:param attn_heads: number of attention heads
:param dropout: dropout rate
"""
super().__init__()
self.hidden = hidden
self.n_layers = n_layers
self.attn_heads = attn_heads
# 原论文将FFN的hidden_size设置为普通hidden_size的4倍
self.feed_forward_hidden = hidden * 4
self.embedding = BERTEmbedding(vocab_size=vocab_size, embed_size=hidden)
self.transformer_blocks = nn.ModuleList(
[TransformerBlock(hidden=hidden, attn_heads=attn_heads,
feed_forward_hidden=hidden * 4, dropout=dropout)
for _ in range(n_layers)])

def forward(self, x, segment_info):
# attention masking for padded token
# torch.ByteTensor([batch_size, 1, seq_len, seq_len)
mask = (x > 0).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1) # padding为0,非padding为1
# embedding the indexed sequence to sequence of vectors
x = self.embedding(x, segment_info)
# running over multiple transformer blocks
for transformer in self.transformer_blocks:
x = transformer.forward(x, mask)
return x

预训练

NSP & MLM

主要包含两个训练任务,NSP就是bert后接一个 linear+softmax 做二分类,MLM是BERT后接linear+softmax做大小为vocab_size的多分类(但是transformers包里面的实现是接 linear(hidden\*hidden) + activation + layerNorm + linear(hidden\*vocab_size) 相当于两层MLP多分类)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# model/language_model.py

class NextSentencePrediction(nn.Module):
"""
2-class classification model : is_next, is_not_next
"""
def __init__(self, hidden):
"""
:param hidden: BERT model output size
"""
super().__init__()
self.linear = nn.Linear(hidden, 2)
self.softmax = nn.LogSoftmax(dim=-1)
def forward(self, x):
return self.softmax(self.linear(x[:, 0]).tanh())


class MaskedLanguageModel(nn.Module):
"""
predicting origin token from masked input sequence
n-class classification problem, n-class = vocab_size
"""
def __init__(self, hidden, vocab_size, embedding=None):
"""
:param hidden: output size of BERT model
:param vocab_size: total vocab size
"""
super().__init__()
self.linear = nn.Linear(hidden, vocab_size)
if embedding is not None:
self.linear.weight.data = embedding.weight.data
self.softmax = nn.LogSoftmax(dim=-1)
def forward(self, x):
return self.softmax(self.linear(x))


class BERTLM(nn.Module):
"""
BERT Language Model
Next Sentence Prediction Model + Masked Language Model
"""
def __init__(self, bert: BERT, vocab_size):
"""
:param bert: BERT model which should be trained
:param vocab_size: total vocab size for masked_lm
"""
super().__init__()
self.bert = bert
self.next_sentence = NextSentencePrediction(self.bert.hidden)
self.mask_lm = MaskedLanguageModel(self.bert.hidden, vocab_size,
embedding=self.bert.embedding.token)
def forward(self, x, segment_label):
x = self.bert(x, segment_label)
return self.next_sentence(x), self.mask_lm(x)

Trainer

这里需要注意的一点是,两个预训练任务都采用NLL loss,对于MLM任务要忽略0(代表<pad>)所以加了个ignore,但是对于NSP由于句子的标签是0或1,所以这里的0是不可以忽略的,也就是说<pad>要参与进训练。优化器使用了AdamW,这里是原作者自己实现的没用现成的包。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# trainer/pretrain.py

class BERTTrainer:
"""
BERTTrainer make the pretrained BERT model with two LM training method.
1. Masked Language Model : 3.3.1 Task #1: Masked LM
2. Next Sentence prediction : 3.3.2 Task #2: Next Sentence Prediction
please check the details on README.md with simple example.
"""

def __init__(self, bert: BERT, vocab_size: int,
train_dataloader: DataLoader, test_dataloader: DataLoader = None,
lr: float = 1e-4, betas=(0.9, 0.999), weight_decay: float = 0.01, warmup_steps=10000,
with_cuda: bool = True, cuda_devices=None, log_freq: int = 10):
"""
:param bert: BERT模型
:param vocab_size: 词汇表大小
:param train_dataloader: 训练集dataloader
:param test_dataloader: 测试机dataloader[can be None]
:param lr: 学习率
:param betas: 优化器偏差 Adam optimizer betas
:param weight_decay: 权重衰减参数 Adam optimizer weight decay param
:param with_cuda: 是否使用cpu traning with cuda
:param log_freq: 日志频率 logging frequency of the batch iteration
"""

# 设置是否使用gpu, argument -c, --cuda should be true
cuda_condition = torch.cuda.is_available() and with_cuda
self.device = torch.device("cuda:0" if cuda_condition else "cpu")

# This BERT model will be saved every epoch
self.bert = bert
# Initialize the BERT Language Model, with BERT model
# 初始化BERT并作为参数初始化BERT Lanuage Model
self.model = BERTLM(bert, vocab_size).to(self.device)

# 分布式GPU训练 Distributed GPU training if CUDA can detect more than 1 GPU
if with_cuda and torch.cuda.device_count() > 1:
print("Using %d GPUS for BERT" % torch.cuda.device_count())
self.model = nn.DataParallel(self.model, device_ids=cuda_devices)

# Setting the train and test data loader
self.train_data = train_dataloader
self.test_data = test_dataloader

# Setting the Adam optimizer with hyper-param
self.optim = AdamW(self.model.parameters(), lr=lr, betas=betas, weight_decay=weight_decay)
# self.optim_schedule = ScheduledOptim(self.optim, self.bert.hidden, n_warmup_steps=warmup_steps)

# 两个任务都采用NLL loss,对于MLM任务0(代表<pad>)要忽略所以加了个ignore,但是对于NSP由于句子的标签是0或1,所以这里的0是不可以忽略的,也就是说<pad>要参与进训练。
# Using Negative Log Likelihood Loss function for predicting the masked_token
self.masked_criterion = nn.NLLLoss(ignore_index=0)
self.next_criterion = nn.NLLLoss()

self.log_freq = log_freq

print("Total Parameters:", sum([p.nelement() for p in self.model.parameters()]))

def train(self, epoch):
self.iteration(epoch, self.train_data)

def test(self, epoch):
self.iteration(epoch, self.test_data, train=False)

def iteration(self, epoch, data_loader, train=True):
"""
loop over the data_loader for training or testing
if on train status, backward operation is activated
and also auto save the model every peoch

:param epoch: current epoch index
:param data_loader: torch.utils.data.DataLoader for iteration
:param train: boolean value of is train or test
:return: None
"""
str_code = "train" if train else "test"

avg_loss = 0.0
total_correct = 0
total_element = 0

for i, data in enumerate(data_loader):
# 0. batch_data will be sent into the device(GPU or cpu)
data = {key: value.to(self.device) for key, value in data.items()}

# 1. forward the next_sentence_prediction and masked_lm model
next_sent_output, mask_lm_output = self.model.forward(data["bert_input"], data["segment_label"])

# 2-1. NLL(negative log likelihood) loss of is_next classification result
next_loss = self.next_criterion(next_sent_output, data["is_next"])

# 2-2. NLLLoss of predicting masked token word
mask_loss = self.masked_criterion(mask_lm_output.transpose(1, 2), data["bert_label"])

# 2-3. Adding next_loss and mask_loss : 3.4 Pre-training Procedure
loss = next_loss + mask_loss

# 3. backward and optimization only in train
if train:
self.optim.zero_grad()
loss.backward()
self.optim.step()

# next sentence prediction accuracy
correct = next_sent_output.argmax(dim=-1).eq(data["is_next"]).sum().item()
avg_loss += loss.item()
total_correct += correct
total_element += data["is_next"].nelement()

post_fix = {
"epoch": epoch,
"iter": "[%d/%d]" % (i, len(data_loader)),
"avg_loss": avg_loss / (i + 1),
"mask_loss": mask_loss.item(),
"next_loss": next_loss.item(),
"avg_next_acc": total_correct / total_element * 100,
"loss": loss.item()
}

if i % self.log_freq == 0:
print(post_fix)

# Logging for PaperSpace matrix monitor
# index = epoch * len(data_loader) + i
# for code in ["avg_loss", "mask_loss", "next_loss", "avg_next_acc"]:
# print(json.dumps({"chart": code, "y": post_fix[code], "x": index}))

print("EP%d_%s, avg_loss=" % (epoch, str_code), avg_loss / len(data_loader), "total_acc=",
total_correct * 100.0 / total_element)

def save(self, epoch, file_path="output/bert_trained.model"):
""
Saving the current BERT model on file_path
:param epoch: current epoch number
:param file_path: model output path which gonna be file_path+"ep%d" % epoch
:return: final_output_path
"""
output_path = file_path + ".ep%d" % epoch
torch.save(self.bert.cpu(), output_path)
self.bert.to(self.device)
print("EP:%d Model Saved on:" % epoch, output_path)
return output_path

参考文献

  1. BERT-pytorch:https://github.com/codertimo/BERT-pytorch/
  2. BERT源码-tensorflow:https://github.com/google-research/bert
  3. 图解什么是Transformer:https://www.jianshu.com/p/e7d8caa13b21
  4. transformers包:https://github.com/huggingface/transformers
  5. Attention机制详解(二)——Self-Attention与Transformer:https://zhuanlan.zhihu.com/p/47282410

本文来源:「想飞的小菜鸡」的个人网站 vodkazy.cn

版权声明:本文为「想飞的小菜鸡」的原创文章,采用 BY-NC-SA 许可协议,转载请附上原文出处链接及本声明。

原文链接:https://vodkazy.cn/2020/10/14/我想去面试系列——BERT源码品读

支付宝打赏 微信打赏

如果文章对你有帮助,欢迎点击上方按钮打赏作者,更多文章请访问想飞的小菜鸡