In [1]:
1
2
3
4
5
6
| python=3.7.6
transformers==2.4.1
torch==1.2.0
tensorflow==2.0.0
|
How to use BERT?
BERT open source: pytorch
If you want to use transformers
module, follow this install guide.
BERT document
Description of how to use transformers
module.
Step1 - Setting
import
some libraries, and declare basic variables and fucntions in order to load and use BERT.
In [2]:
1
2
3
| -f:9: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
03/06/2020 18:57:01 - Hellow World!
|
In [3]:
In [4]:
1
| ('bert-base-cased', './cache')
|
Load configuration object for BERT
Prerequsite argument:
pretrained_model_name_or_path
: the name of BERT model to use.
Optional:
cache_dir
: we can select cache directory to save.
./cache/b945b69218e98 ...
file will be saved.
REF: pack, unpack in python.
In [5]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
| 03/06/2020 18:57:02 - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json from cache at ./cache/b945b69218e98b3e2c95acf911789741307dec43c698d35fad11c1ae28bda352.3d5adf10d3445c36ce131f4c6416aa62e9b58e1af56b97664773f4858a46286e
03/06/2020 18:57:02 - Model config BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"do_sample": false,
"eos_token_ids": 0,
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 2,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"pruned_heads": {},
"repetition_penalty": 1.0,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 28996
}
03/06/2020 18:57:03 - loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /home/kddlab/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
|
In [6]:
1
2
| 03/06/2020 18:57:03 - Optional: output all layers' states
|
Define a custom model to make use of BERT.
If you wanted, you can make custom model to use BERT.
I want to use BERT embedding for my research, so I added a linear
layer to train for specific task, later.
In [7]:
1
2
3
4
| 03/06/2020 18:57:04 - loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin from cache at ./cache/35d8b9d36faaf46728a0192d82bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2
03/06/2020 18:57:06 - Weights of MyModel not initialized from pretrained model: ['linear.weight', 'linear.bias']
03/06/2020 18:57:06 - Weights from pretrained model not used in MyModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
|
Prepare for Dataset to use
REF1 - BERT Word Embeddings Tutorials
REF2 - Visualization of BERT Embeddings with t-SNE
In [8]:
In [9]:
Step2 - Get BERT Embedding by forward step
In [10]:
1
2
3
| 03/06/2020 18:57:06 - eval mode
03/06/2020 18:57:06 - torch.Size([1, 27, 300]), torch.Size([1, 27, 768]), 13
|
In [11]:
1
| (13, torch.Size([1, 27, 768]))
|
In [12]:
1
2
| 03/06/2020 18:57:06 - torch.Size([768])
|
In [13]:
Reshape hidden states of BERT-output for analysis.
I will reshape BERT-output into [#tokens, #layers, #features]
In [14]:
1
2
3
4
| 03/06/2020 18:57:07 - torch.Size([12, 1, 27, 768])
03/06/2020 18:57:07 - torch.Size([12, 27, 768])
03/06/2020 18:57:07 - torch.Size([27, 12, 768])
|
Step3 - Create word and sentence vertors
issue
which layer or combination of layers provides the best representation? In BERT paper, they compared it by F1-scores.
결론: task 마다 다르다. It depends on the situation… what is your applicaiton?
the correct pooling strategy and layers used (last four, all, last layer, etc.) is dependent on the application.
Word Vectors
There are many ways to create word vectors.
- Concatenate the last 4 layers. Each vector will has length $4 \times 768 = 3072$
- Summation of the last 4 layers. Each vector will has length $768$
- Etc.
In [15]:
In [16]:
1
2
| 03/06/2020 18:57:07 - # of tokens: 27, # of dim for each words: torch.Size([768])
|
Sentence Vectors
There are also many ways to create word vectors.
Take one option.
In this blog, I take the average of all tokens’ embeddings in the last layer.
In [17]:
1
2
3
| 03/06/2020 18:57:07 - torch.Size([25, 768])
03/06/2020 18:57:07 - torch.Size([768])
|
Step 4 - Analysis of a Case Study
According to this article at section 3.4, the arthor said that “value of these vectors[sentence and words vectors] are in fact contextually dependent.”
Let’s look at the different instances of the word “bank” in our example sentence:
'’After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank.’’
Note that each of the three bank has different meaning contextually.
In [18]:
1
2
3
4
5
| 03/06/2020 18:57:07 - After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank.
03/06/2020 18:57:07 - tensor([[ 101, 1170, 11569, 1948, 1121, 1103, 3085, 13454, 117, 1103,
3085, 187, 12809, 3169, 1108, 1562, 5339, 1113, 1103, 5529,
14788, 9717, 8508, 2186, 3085, 119, 102]])
|
In [19]:
1
2
| 03/06/2020 18:57:07 - [6, 10, 24]
|
Calculate cosine similarity between the vectors embed_words
from step 3
As a human, we can easily notice that
'bank'
is similar meaning where the 'bank' vault
and the 'bank' robber
.
'bank'
is different meaning where river 'bank'
and the 'bank' vault
or the 'bank' robber
.
In [23]:
1
2
3
4
| 03/06/2020 19:12:10 - 27, torch.Size([768])
03/06/2020 19:12:10 - ['bank', 'bank', 'bank']
03/06/2020 19:12:10 - score1=[[0.8953148]]| score2=[[0.7670008]]| score3=[[0.73296183]]
|
Report
According to the author of this article
it is worth noting that word-level similarity comparisons are not appropriate with BERT embeddings because these embeddings are contextually dependent. This makes direct word-to-word similarity comparisons less valuable.
결론: 문맥에 따라 벡터 표현이 다르므로 word 간의 similarity 비교는 큰 의미가 없다.
하지만, BERT는 sentence가 같으면 벡터가 같도록 디자인 하였기 때문에 sentence 사이의 similarity는 의미가 있을 수 있다.
Leave a comment