LSTM Classifier

[ 인공지능/프레임워크 or 라이브러리 ]

LSTM Classifier

2024년 03월 21일 07시 04분 25초에 업로드 된 글입니다.

작성자: 재형이

실습 목표

자연어 처리를 위한 주요 신경망의 사용법을 이해한다.
영화 리뷰평의 긍정/부정을 판단하기 위한 감정분석 분류 모델을 생성한다.
데이터셋: imdb 전처리 완료 데이터

문제 정의

binary classifier(positive or negative)

1. RNN의 유형

many to one
- 영화 리뷰 텍스트(many)를 입력으로 받아 긍정 또는 부정(one)을 출력하는 구조
- Embedding: 영화 리뷰(text)를 벡터로 변환하는 연산
- LSTM: 시계열 데이터를 처리하기 위한 구조
- Linear: 결과 출력

모델 구조 미리보기

LSTMClassifier(
  (embedding): Embedding(121301, 256)
  (lstm): LSTM(256, 512, num_layers=2, batch_first=True, dropout=0.25)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

torch.nn.LSTM
- Parameters
  - input_size
  - hidden_size
  - num_layers
  - dropout
  - batch_first
  - bidirectional
- Inputs: input, (h_0, c_0)
  - input shape
    - (L, N, H_in) : batch_size=False
    - (N, L, H_in) : batch_size=True
- Outputs: output, (h_n, c_n)
  - output shape
    - (L, N, D*H_out) : batch_size=False
    - (N, L, D*H_out) : batch_size=True
N: batch size
L: sequence length
D: 2(bidirectional) or 1
H_in: input_size
H_cell: hidden_size
H_out: proj_size or hidden_size

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_size=400):
        super(LSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.lstm = nn.LSTM(embedding_size, 512, 2, dropout=0.25, batch_first=True)
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(512, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = x.long()
        x = self.embedding(x)
        o, _ =  self.lstm(x)
        o = o[:, -1, :]
        o = self.dropout(o)
        o = self.fc(o)
        o = self.sigmoid(o)

        return o

2. 텍스트 데이터 전처리

텍스트를 벡터로 변환하는 작업
1. 원본
  - one reviewer mentioned watching oz episode hooked
2. 사전 생성
  - 리뷰 문장에 들어있는 단어들을 추출하고, 각각의 단어에 숫자를 부여하는 작업
  - ['one', 'reviewer', 'mentioned', 'watching', 'oz', 'episode', 'hooked']
3. 리뷰인코딩
  - 리뷰에 포함된 단어를 숫자형태로 변환하는 작업
  - {'i': 1, 'movie': 2, 'film': 3, 'the': 4, 'one': 5, 'like': 6, 'it': 7, 'time': 8, 'this': 9, 'good': 10, 'character': 11,...}
  - [5, 1095, 972, 74, 2893, 186, 2982, 119, 114, 538]
4. 길이 맞춰주기: padding or trim
  - 신경망의 입력으로 사용하기 위에 일정 길이만큼 맞춰주는 작업
  - 길이가 긴 문장은 잘라주고(trim), 길이가 짧은 문장은 채워주는(padding) 작업
  - [[ 191, 1083, 930, 81, 3724, 186, 3030, 1, 118, 114],
    [ 47, 328, 59, 244, 1, 7, 1267, 1608, 17875, 4],
    [ 3, 95, 328, 30, 1041, 13, 845, 1774, 2633, 2],...]
5. 학습용, 테스트용 분할
6. 데이터 로더 생성

3. tqdm

iterable을 감싸서 진행률을 표시할 때 사용
학습 과정의 iteration에서 진행률을 표시할 때 사용

from tqdm import tqdm
import time

for i in tqdm(range(100)):
    time.sleep(0.05)
    pass

tqdm parameter
- itereable: 반복자
- desc: 진행바 앞에 텍스트 출력
- leave: 진행상태를 남겨둘지 여부

# train loop
epochloop = tqdm(range(epochs), desc='Training')

4. Early stop

es_trigger = 0
es_limit = 5

for e in epochloop:
    train_loss, train_acc = train(model, trainloader)
    val_loss, val_acc = validation(model, valloader)

    # save model if validation loss decrease
    if val_loss / len(valloader) <= val_loss_min:
        torch.save(model.state_dict(), './sentiment_lstm.pt')
        val_loss_min = val_loss / len(valloader)
        es_trigger = 0
    else:       
        es_trigger += 1

    # early stop
    if es_trigger >= es_limit:
        break

Sentimental Analysis

[Step1] Load libraries & Datasets

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import torch
from torch import nn
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader

import os
from tqdm import tqdm
tqdm.pandas()
from collections import Counter

data = pd.read_csv('exercise4.csv')
data.head()

data['processed'] = data['processed'].str.lower().replace(r"[^a-zA-Z ]", "", regex=True)

data['processed'][0]

one reviewer mentioned watching oz episode hooked they right exactly happened the first thing struck oz brutality unflinching scene violence set right word go trust show faint hearted timid this show pull punch regard drug sex violence its hardcore classic use word it called oz nickname given oswald maximum security state penitentary it focus mainly emerald city experimental section prison cell glass front face inwards privacy high agenda em city home many aryans muslims gangsta latinos christians italians irish scuffle death stare dodgy dealing shady agreement never far away i would say main appeal show due fact go show dare forget pretty picture painted mainstream audience forget charm forget romance oz mess around the first episode i ever saw struck nasty surreal i say i ready i watched i developed taste oz got accustomed high level graphic violence not violence injustice crooked guard sold nickel inmate kill order get away well mannered middle class inmate turned prison bitch due lack street skill prison experience watching oz may become comfortable uncomfortable viewing thats get touch darker side

사전생성

# 문장에 포함된 단어 토큰화
reviews = data['processed'].values
words = ' '.join(reviews).split()
words[:10]

['one',
'reviewer',
'mentioned',
'watching',
'oz',
'episode',
'hooked',
'they',
'right',
'exactly']

counter = Counter(words)
vocab = sorted(counter, key=counter.get, reverse=True)
int2word = dict(enumerate(vocab, 1))
int2word[0] = '<PAD>'
word2int = {word: id for id, word in int2word.items()}
word2int

{'i': 1,
'movie': 2,
'film': 3,
'the': 4,
'one': 5,
'like': 6,
...}

리뷰 인코딩

reviews_enc = [[word2int[word] for word in review.split()] for review in tqdm(reviews)]

reviews_enc[0][:10]

data['processed'][0]

word2int['one'], word2int['reviewer'], word2int['mentioned']

data['encoded'] = reviews_enc

길이 맞춰주기

def pad_features(reviews, pad_id, seq_length=128):
    features = np.full((len(reviews), seq_length), pad_id, dtype=int)  # np.full((5, 3), 2)

    for i, row in enumerate(reviews):
        # if seq_length < len(row) then review will be trimmed
        features[i, :len(row)] = np.array(row)[:seq_length]

    return features

seq_length = 256
features = pad_features(reviews_enc, pad_id=word2int['<PAD>'], seq_length=seq_length)

assert len(features) == len(reviews_enc)
assert len(features[0]) == seq_length

labels = data['label'].to_numpy()
labels

array([1, 1, 1, ..., 0, 0, 0])

데이터 분할

# train test split
train_size = .8
split_id = int(len(features) * train_size)
train_x, test_x, train_y, test_y = features[:split_id], features[split_id:], labels[:split_id], labels[split_id:]

split_id = int(len(train_x) * train_size)
train_x, valid_x, train_y, valid_y = train_x[:split_id], train_x[split_id:], train_y[:split_id], train_y[split_id:]
print('Train shape:{}, Valid shape: {}, Test shape: {}'.format(train_x.shape, valid_x.shape, test_x.shape))
print('Train shape:{}, Valid shape: {}, Test shape: {}'.format(train_y.shape, valid_y.shape, test_y.shape))

Train shape:(32000, 256), Valid shape: (8000, 256), Test shape: (10000, 256)
Train shape:(32000,), Valid shape: (8000,), Test shape: (10000,)

[Step2] Create DataLoader

# set hyperparameter
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

lr = 0.001
batch_size = 128
vocab_size = len(word2int)
embedding_size = 256
dropout=0.25

epochs = 8
history = {
    'train_loss': [],
    'train_acc': [],
    'val_loss': [],
    'val_acc': [],
    'epochs': epochs
}
es_limit = 5

trainset = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
validset = TensorDataset(torch.from_numpy(valid_x), torch.from_numpy(valid_y))
testset = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

trainloader = DataLoader(trainset, shuffle=True, batch_size=batch_size)
valloader = DataLoader(validset, shuffle=True, batch_size=batch_size)
testloader = DataLoader(testset, shuffle=True, batch_size=batch_size)

[Step3] Set Network Structure

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_size=400):
        super(LSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.lstm = nn.LSTM(embedding_size, 512, 2, dropout=0.25, batch_first=True)
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(512, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = x.long()
        x = self.embedding(x)
        o, _ =  self.lstm(x)
        o = o[:, -1, :]
        o = self.dropout(o)
        o = self.fc(o)
        o = self.sigmoid(o)

        return o

[Step4] Create Model instance

model = LSTMClassifier(vocab_size, embedding_size).to(device)
print(model)

LSTMClassifier(
  (embedding): Embedding(96140, 256)
  (lstm): LSTM(256, 512, num_layers=2, batch_first=True, dropout=0.25)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

[Step5] Model compile

criterion = nn.BCELoss()
optim = Adam(model.parameters(), lr=lr)

[Step6] Set train loop

def train(model, trainloader):
    model.train()

    train_loss = 0
    train_acc = 0

    for id, (X, y) in enumerate(trainloader):
        X, y = X.to(device), y.to(device)        
        optim.zero_grad()
        y_pred = model(X)
        loss = criterion(y_pred.squeeze(), y.float())
        loss.backward()
        optim.step()

        train_loss += loss.item()
        y_pred = torch.tensor([1 if i == True else 0 for i in y_pred > 0.5], device=device)
        equals = y_pred == y
        acc = torch.mean(equals.type(torch.FloatTensor))
        train_acc += acc.item()

    history['train_loss'].append(train_loss / len(trainloader))
    history['train_acc'].append(train_acc / len(trainloader))

    return train_loss, train_acc

[Step7] Set test loop

def validation(model, valloader):
    model.eval()

    val_loss = 0
    val_acc = 0

    with torch.no_grad():
        for id, (X, y) in enumerate(valloader):            
            X, y = X.to(device), y.to(device)
            y_pred = model(X)
            loss = criterion(y_pred.squeeze(), y.float())
            
            val_loss += loss.item()
            
            y_pred = torch.tensor([1 if i == True else 0 for i in y_pred > 0.5], device=device)
            equals = y_pred == y
            acc = torch.mean(equals.type(torch.FloatTensor))
            val_acc += acc.item()

        history['val_loss'].append(val_loss / len(valloader))
        history['val_acc'].append(val_acc / len(valloader))

    return val_loss, val_acc

[Step8] Run model

# train loop
epochloop = tqdm(range(epochs), desc='Training')

# early stop trigger
es_trigger = 0
val_loss_min = torch.inf

for e in epochloop:
    train_loss, train_acc = train(model, trainloader)
    val_loss, val_acc = validation(model, valloader)
    epochloop.write(f'Epoch[{e+1}/{epochs}] Train Loss: {train_loss / len(trainloader):.3f}, Train Acc: {train_acc / len(trainloader):.3f}, Val Loss: {val_loss / len(valloader):.3f}, Val Acc: {val_acc / len(valloader):.3f}')

    # save model if validation loss decrease
    if val_loss / len(valloader) <= val_loss_min:
        torch.save(model.state_dict(), './sentiment_lstm.pt')
        val_loss_min = val_loss / len(valloader)
        es_trigger = 0
    else:       
        es_trigger += 1

    # early stop
    if es_trigger >= es_limit:
        epochloop.write(f'Early stopped at Epoch-{e+1}')
        history['epochs'] = e+1
        break

Training:   0%|          | 0/8 [01:01<?, ?it/s]Epoch[1/8] Train Loss: 0.694, Train Acc: 0.503, Val Loss: 0.697, Val Acc: 0.502
Training:  12%|█▎        | 1/8 [02:04<07:09, 61.43s/it]Epoch[2/8] Train Loss: 0.694, Train Acc: 0.499, Val Loss: 0.693, Val Acc: 0.498
Training:  25%|██▌       | 2/8 [03:11<06:16, 62.80s/it]Epoch[3/8] Train Loss: 0.693, Train Acc: 0.496, Val Loss: 0.693, Val Acc: 0.514
Training:  38%|███▊      | 3/8 [04:19<05:22, 64.56s/it]Epoch[4/8] Train Loss: 0.693, Train Acc: 0.501, Val Loss: 0.692, Val Acc: 0.514
Training:  62%|██████▎   | 5/8 [05:28<03:20, 66.83s/it]Epoch[5/8] Train Loss: 0.692, Train Acc: 0.511, Val Loss: 0.697, Val Acc: 0.497
Training:  75%|███████▌  | 6/8 [06:36<02:14, 67.43s/it]Epoch[6/8] Train Loss: 0.695, Train Acc: 0.498, Val Loss: 0.696, Val Acc: 0.497
Training:  88%|████████▊ | 7/8 [07:45<01:07, 67.91s/it]Epoch[7/8] Train Loss: 0.693, Train Acc: 0.506, Val Loss: 0.694, Val Acc: 0.497
Training: 100%|██████████| 8/8 [08:54<00:00, 66.83s/it]Epoch[8/8] Train Loss: 0.691, Train Acc: 0.506, Val Loss: 0.693, Val Acc: 0.514

# plot loss
plt.figure(figsize=(6, 4))
plt.plot(range(history['epochs']), history['train_acc'], label='Train Acc')
plt.plot(range(history['epochs']), history['val_acc'], label='Val Acc')
plt.legend()
plt.show()

# plot loss
plt.figure(figsize=(6, 4))
plt.plot(range(history['epochs']), history['train_loss'], label='Train Loss')
plt.plot(range(history['epochs']), history['val_loss'], label='Val Loss')
plt.legend()
plt.show()

'인공지능 > 프레임워크 or 라이브러리' 카테고리의 다른 글

ResNet 이미지 분류 (2)	2024.03.23
GAN (2)	2024.03.22
Variational Autoencoder (0)	2024.03.20
VGGNet을 사용한 이미지 분류기 실습 (2)	2024.03.19
AlexNet을 사용한 이미지 분류기 실습 (0)	2024.03.18

다음글이 없습니다.

이전글이 없습니다.

실습 목표

문제 정의

1. RNN의 유형

모델 구조 미리보기

2. 텍스트 데이터 전처리

3. tqdm

4. Early stop

Sentimental Analysis

[Step1] Load libraries & Datasets

사전생성

리뷰 인코딩

길이 맞춰주기

데이터 분할

[Step2] Create DataLoader

[Step3] Set Network Structure

[Step4] Create Model instance

[Step5] Model compile

[Step6] Set train loop

[Step7] Set test loop

[Step8] Run model

'인공지능 > 프레임워크 or 라이브러리' 카테고리의 다른 글

티스토리툴바