어텐션 메커니즘(Attention Mechanism)의 이해

머신러닝과 딥러닝

어텐션 메커니즘(Attention Mechanism)의 이해

yjam 2020. 1. 10. 19:31

Scaled Dot-Product Attetion을 계산하는 식은 다음과 같다.

$ Attention(Q,K,V) = softmax( \frac {QK^{T}}{ \sqrt { d_{k}} } ) V $

여기서 Q는 query, K는 key, V는 value를 의미한다.

"encoder-decoder attention"의 경우,
Q : 디코더의 이전 레이어 hidden state
K : 인코더의 output state
V : 인코더의 output state

"self-attention"의 경우,
Q=K=V : 인코더의 output state

seq2seq + attention의 예시

1단계 : 모든 encoder의 hidden state의 스코어(scalar)를 얻는다.

위 예시에서 score는 decoder($D$)와 encoder($E$)의 hidden state의 내적(inner product)이다

$ E \cdot D^{T} $

$ E $의 shape는 $(4,3)$이고

$ D^{T} $의 shape는 $(3,1)$이다.

$ E \cdot D^{T} = (4,3) \times (3,1) = (4,1)$

decoder_hidden = [10, 5, 10]
encoder_hidden  score
---------------------
     [0, 1, 1]     15 (= 10×0 + 5×1 + 10×1, the dot product)
     [5, 0, 1]     60
     [1, 1, 0]     15
     [0, 5, 1]     35

위 예시에서 encoder hidden state [ 5, 0, 1 ]에서 60이라는 높은 attention score를 얻었다.

이는 번역될 다음 단어가 이 encoder hidden state [ 5, 0, 1 ]의 영향을 크게 받는다는 의미다.

2단계 : 모든 스코어에 softmax함수를 취한다.

import numpy as np
x = np.array([[15],[60],[15],[35]])
softmax_x = np.exp(x)/sum(np.exp(x))

print(x.shape)
print(softmax_x)

# (4, 1)
# [[2.86251858e-20]
#  [1.00000000e+00]
#  [2.86251858e-20]
#  [1.38879439e-11]]

softmax 출력은 [ 0, 1, 0, 0 ] 이다.

(이진수가 아니고 모든 숫자가 0과 1사이의 부동소수점이고 이 값을 다 합하면 1이 된다.

softmax 출력은 attention distribution(어텐션 분포)를 의미합니다.

attention distribution은 encoder hidden state [ 5, 0, 1 ] 에만 가장 높게 나타난다.

3단계 : 각 encoder hidden state와 softmax의 출력을 곱(multiplication)한다.

softmax output의 shape는 $ (4, 1) $이다.

encoder hidden state의 shape는 $ (4, 3) $이다.

여기서는 dot product $ (\cdot) $이 아니고 multiplication이다.

encoder_hidden_state = np.array([[0,1,1],[5,0,1],[1,1,0],[0,5,1]])
x                    = np.array([[15],[60],[15],[35]])
softmax_x            = np.exp(x)/sum(np.exp(x))

print(softmax_x, softmax_x.shape)
print(encoder_hidden_state,encoder_hidden_state.shape)

multiply_vector = np.multiply(encoder_hidden_state, softmax_x)

print("multiply_vector")
print(multiply_vector, multiply_vector.shape)

여기서 multiply_vector는 alignment vector [1] 또는 annotation vector [2]라고 부른다.

이것이 attention이 이루어지는 메커니즘이다.

현재까지의 과정을 다시 정리하면 아래와 같다.

encoder    score  softmax   alignment
-------------------------------------
[0, 1, 1]     15        0   [0, 0, 0]
[5, 0, 1]     60        1   [5, 0, 1]
[1, 1, 0]     15        0   [0, 0, 0]
[0, 5, 1]     35        0   [0, 0, 0]

encoder hidden state와 softmax를 곱한 것의 결과(alignment)를 보면

낮은 attention score로 인해

[ 5, 0, 1 ] 을 제외하고 모든 alignment는 0이 되었다.

이 의미는 첫번째로 번역된 단어(첫번째 decoder hidden state)가 [ 5, 0, 1 ] 임베딩과 관련된다는 것을 기대할 수 있다는 것이다.

4단계 : alignment vector를 합(sum)한다.

alignment vector는 합산되어 context vector [1], [2]를 얻을 수 있다.

context vector는 이전 스텝의 alignment vector의 정보를 총합한 것이다.

alignment_vector = multiply_vector

context = np.sum(alignment_vector, axis=0)
print(context, context.shape)

# [5.00000000e+00 6.94397194e-11 1.00000000e+00] (3,)