๐Ÿ’ฌ NLP

๐Ÿ’ฌ NLP/PLM

[Paper Review] Don't Stop Pretraining: Adapt Language Models to Domains and Tasks(ACL 2020)

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks 0. Abstract ๋‹ค์–‘ํ•œ source์˜ text๋กœ pre-training์„ ์ˆ˜ํ–‰ํ•œ ๋ชจ๋ธ์€ ์˜ค๋Š˜๋‚  NLP์˜ ํ† ๋Œ€๋ฅผ ํ˜•์„ฑ pre-trained model์„ target task์˜ ๋„๋ฉ”์ธ์— ๋งž๊ฒŒ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ์—ฌ์ „ํžˆ ๋„์›€์ด ๋˜๋Š”์ง€ ํ™•์ธ 4๊ฐœ์˜ ๋„๋ฉ”์ธ(biomedical, computer science publications, news, reviews)๊ณผ 8๊ฐœ์˜ classification task๋ฅผ ํ†ตํ•ด study `domain-adaptive pretraining`์ด ๋ฆฌ์†Œ์Šค๊ฐ€ ๋งŽ์€ ํ™˜๊ฒฝ๊ณผ ์ ์€ ํ™˜๊ฒฝ ๋ชจ๋‘์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋ฃจ์–ด๋ƒ„์„ ๋ณด์ž„ unlabeled data์— adapting ํ•˜๋Š” `..

๐Ÿ’ฌ NLP/PLM

[Paper Review] BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and comprehension(2019)

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension 0. Abstract sequence-to-sequence model์„ pre-trainingํ•˜๊ธฐ ์œ„ํ•œ denoising autoencoder, `BART`๋ฅผ ์ œ์•ˆ BART๋Š” ๋‘ ๊ฐ€์ง€ ๊ณผ์ •์„ ํ†ตํ•ด ํ•™์Šต ์ž„์˜์˜ `noising function`์„ ํ†ตํ•ด text๋ฅผ ์†์ƒ ์†์ƒ๋œ text๋ฅผ ํ†ตํ•ด original text๋ฅผ ์žฌ๊ตฌ์„ฑํ•˜๋ฉฐ ๋ชจ๋ธ์„ ํ•™์Šต ๋‹จ์ˆœํ•จ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  BERT(bidirectional encoder๋กœ ์ธํ•ด), GPT(left-to-right decoder) ๋“ฑ ํ˜„์žฌ์˜ ์ตœ์‹  pre-training ์ฒด..

๐Ÿ’ฌ NLP/PLM

[Paper Review] GPT-2: Language Models are Multitask Learners(2019)

Language Models are Unsupervised Multitask Learners 0. Abstract Question Answering, Machine Translation, Reading Comprehension, Summarization ๋“ฑ์˜ NLP task๋“ค์€ task-specific dataset์„ ํ†ตํ•œ Supervised Learning์„ ํ™œ์šฉ ์ˆ˜๋ฐฑ ๋งŒ๊ฐœ์˜ webpage๋“ค๋กœ ๊ตฌ์„ฑ๋œ `WebText`๋ผ๋Š” dataset์„ ํ†ตํ•ด trainํ•  ๋•Œ Language Model์ด ๋ช…์‹œ์ ์ธ supervision ์—†์ด๋„ ์ด๋Ÿฌํ•œ task๋“ค์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์‹œ์ž‘ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์ž…์ฆ Language Model์˜ ์šฉ๋Ÿ‰์€ `zero-shot` task transfer์˜ ์„ฑ๊ณต์— ๋งค์šฐ ํ•„์ˆ˜์ ์ด๋ฉฐ, ์ด๊ฒƒ์ด ๊ฐœ์„ ๋˜๋ฉด ์ž‘์—… ..

๐Ÿ’ฌ NLP/PLM

[Paper Review] RoBERTa: A Robustly Optimized BERT Pretraining Approach(2019)

RoBERTa: A Robustly Optimized BERT Pretraining Approach 0. Abstract Language Model์˜ Pre-training ๊ณผ์ •์€ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ฐ€์ง€๊ณ  ์™”์ง€๋งŒ, ๋‹ค์–‘ํ•œ approach ๊ฐ„์— ์‹ ์ค‘ํ•œ ๋น„๊ต ํ•„์š” ํ•™์Šต์€ ์ƒ๋‹นํ•œ ๊ณ„์‚ฐ ๋น„์šฉ์ด ํ•„์š”ํ•˜๊ณ , ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ private dataset์„ ํ†ตํ•ด ํ›ˆ๋ จํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ hyperparameter์— ๋Œ€ํ•œ ์„ ํƒ์ด ์ตœ์ข… ๊ฒฐ๊ณผ์— ์ปค๋‹ค๋ž€ ์˜ํ–ฅ์„ ๋ผ์นจ ๋ณธ ๋…ผ๋ฌธ์€ `BERT`์˜ ์—ฌ๋Ÿฌ key hyperparameter์™€ training data size์˜ ์˜ํ–ฅ๋ ฅ์„ ์‹ ์ค‘ํ•˜๊ฒŒ ์ธก์ •ํ•˜๋Š” replication study BERT์˜ ํ•™์Šต์ด ๋งค์šฐ ๋ถ€์กฑํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์œผ๋ฉฐ, BERT๋งŒ์œผ๋กœ ์ดํ›„ ๊ฐœ๋ฐœ๋œ ๋ชจ๋ธ๋“ค์˜ ์„ฑ๋Šฅ์„ ์ด๊ธธ ์ˆ˜ ์žˆ..

๐Ÿ’ฌ NLP/PLM

[Paper Review] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding(NAACL 2019)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 0. Abstract `BERT(Bidirectional Encoder Representations from Transformer)`๋ผ๋Š” ์ƒˆ๋กœ์šด language representation model์„ ์ œ์‹œ ๋‹น์‹œ์— ๋‚˜์˜จ language represtation model๋“ค๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ BERT๋Š” ๋ชจ๋“  layer์—์„œ left/right context๋ฅผ ๋™์‹œ์— ๊ณ ๋ คํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ unlabeled text๋กœ๋ถ€ํ„ฐDeep Bidirectional Representation์„ pre-train ํ•˜๋„๋ก ์„ค๊ณ„ BERT๋Š” ์ถ”๊ฐ€์ ์ธ 1๊ฐœ์˜ output layer๋ฅผ ํ†ตํ•ด fine-tu..

๐Ÿ’ฌ NLP/PLM

[NLP] GPT(Generative Pre-Training of a Language Model)

Motivation `ELMo`์™€ ์•„์ด๋””์–ด๋Š” ๋™์ผ Unlabeled Text Corpus๋ฅผ ํ™œ์šฉํ•˜์—ฌ GPT๋ฅผ ํ†ตํ•ด `pre-training`์„ ๊ฑฐ์ณ embedding vector๋ฅผ ์ฐพ์•„๋‚ด๊ณ , specific task๋ฅผ ์œ„ํ•œ Labeled Text Corpus๋ฅผ ํ™œ์šฉํ•ด `fine-tuning`์„ ๊ฑฐ์ณ ์ด๋ฅผ ์ˆ˜ํ–‰ unlabeled text๋กœ๋ถ€ํ„ฐ word-level ์ด์ƒ์˜ ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ค์›€ `transfer`์— ์œ ์šฉํ•œ text representation์„ ํ•™์Šตํ•˜๋Š”๋ฐ ์–ด๋– ํ•œ optimization objective๊ฐ€ ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ์ง€ ๋ถˆํ™•์‹ค ํ•™์Šต๋œ representation์„ target task์— transfer ํ•˜๋Š”๋ฐ ๋ชจ๋ธ ์•„ํ‚คํ…์ณ์— task-specificํ•œ ๋ณ€ํ™”๋ฅผ ํ•˜๋Š” ๊ฒƒ, intricate l..

๐Ÿ’ฌ NLP/PLM

[NLP] ELMo(Embeddings from Language Models)

Pre-trained word representation Pre-trained word respresentation์€ ๋งŽ์€ neural language understanding model์—์„œ ์ค‘์š”ํ•œ ์š”์†Œ ๋†’์€ ํ’ˆ์งˆ์˜ representation์€ 2๊ฐ€์ง€๋ฅผ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•จ ๋‹จ์–ด์˜ ๋ณต์žกํ•œ ํŠน์„ฑ(ex> syntax, semantic) ๋‹จ์–ด๋“ค์ด linguistic context ์ƒ์—์„œ ์„œ๋กœ ๋‹ค๋ฅด๊ฒŒ ์‚ฌ์šฉ๋  ๋•Œ, ์‚ฌ์šฉ๋ฒ•์— ๋งž๋Š” representation์„ ํ‘œํ˜„ "๋ˆˆ"์ด๋ผ๋Š” ๋‹จ์–ด๋Š” "eye", "snow"๋กœ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•œ๋ฐ ์ด์— ๋งž๊ฒŒ embedding์ด ๋‹ฌ๋ผ์•ผ ํ•จ ELMo(Embeddings from Language Models)์˜ ํŠน์ง• ๊ธฐ์กด์— ๋‹จ์–ด์— ์ง‘์ค‘ํ–ˆ๋˜ ๊ฒƒ์—์„œ ๋ฒ—์–ด๋‚˜ ์ „์ฒด input sentence๋ฅผ ๊ณ ..

๐Ÿ’ฌ NLP/Attention & Transformer

[Paper Review] Attention Is All You Need(NIPS 2017)

Attention Is All You Need(NIPS 2017) 0. Abstract Sequence transduction ๋ชจ๋ธ์€ `RNN` ํ˜น์€ `CNN`์— ๊ธฐ๋ฐ˜ํ•˜๋ฉฐ `Encoder-Decoder` ๊ตฌ์กฐ๋ฅผ ํฌํ•จ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•œ ๋ชจ๋ธ๋“ค๋„ Encoder์™€ Decoder๋ฅผ `Attention mechanism`์œผ๋กœ ์—ฐ๊ฒฐ ๋ณธ ๋…ผ๋ฌธ์€ RNN๊ณผ CNN์„ ์™„์ „ํžˆ ๋ฐฐ์ œํ•œ, ์˜ค์ง Attention mechanism์— ๊ธฐ๋ฐ˜ํ•œ ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ `Transformer`๋ฅผ ์ œ์•ˆ Transformer๋Š” ๋ณ‘๋ ฌํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ›ˆ๋ จ์— ์ ์€ ์‹œ๊ฐ„์ด ์†Œ์š”๋จ๊ณผ ๋™์‹œ์— ๋‘ ๊ฐœ์˜ ๋ฒˆ์—ญ task์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„ WMT 2014 English-German Translation task์—์„œ 28.4 BLEU์„ ๊ธฐ๋กํ•˜์—ฌ ์ตœ๊ณ ์˜ ..

๐Ÿ’ฌ NLP/Attention & Transformer

[Paper Review] Effective Approaches to Attention-based Neural Mahine Translation(EMNLP 2015)

Effective Approaches to Attention-based Neural Machine Translation(EMNLP 2015) 0. Abstract `Attention` mehanism์€ ๋ฒˆ์—ญ ๊ณผ์ •์—์„œ source sentence๋ฅผ ์„ ํƒ์ ์œผ๋กœ focusing ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ NMT(Neural Machine Translation)๋ฅผ ๊ฐœ์„ ์‹œํ‚ค๋Š” ๋ฐ ์‚ฌ์šฉ๋จ ๊ทธ๋Ÿฌ๋‚˜ NMT ๋ถ„์•ผ์—์„œ ๋”์šฑ ํšจ์œจ์ ์œผ๋กœ attention์„ ์‚ฌ์šฉํ•˜๋Š” architecture๋ฅผ ํƒ์ƒ‰ํ•˜๋Š” ์ž‘์—…์€ ๊ฑฐ์˜ ์—†์—ˆ์Œ 2๊ฐœ์˜ ๊ฐ„๋‹จํ•˜๊ณ  ํšจ๊ณผ์ ์ธ Attention Mechanism์„ ์ œ์‹œ ํ•ญ์ƒ ๋ชจ๋“  source word๋ฅผ ํ™œ์šฉํ•˜๋Š” `global` attentional model ํ•œ ๋ฒˆ์— source word์˜ subset๋งŒ ํ™œ์šฉํ•˜๋Š” `loc..

๐Ÿ’ฌ NLP/Attention & Transformer

[Paper Review] ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale(ICLR 2021)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale(ViT)(ICLR 2021) Abstract `Transformer` ๊ตฌ์กฐ๊ฐ€ NLP task์—์„œ ์‚ฌ์‹ค์ƒ ๊ธฐ์ค€์ด ๋˜๋Š” ๋™์•ˆ Computer Vision์—์„œ์˜ ์‘์šฉ์€ ์ œํ•œ์  Vison์—์„œ `Attention`์€ Convolutional networks์™€ ํ•จ๊ป˜ ์ ์šฉํ•˜๊ฑฐ๋‚˜ ์ „์ฒด ๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ Convolution Network์˜ ํŠน์ • ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ CNN์— ์˜์กดํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋ฉฐ Image patches์˜ sequences์— ์ง์ ‘ ์ ์šฉ๋œ pure transformer๊ฐ€ Image classification task์—์„œ ๋งค์šฐ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„ ๋Œ€๋Ÿ‰์˜ ๋ฐ์ด..

Junyeong Son
'๐Ÿ’ฌ NLP' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๊ธ€ ๋ชฉ๋ก