TIL | LDA๋ž€? - NLP Topic Modeling

Table of Contents

[ NLP ] Topic Modeling : LDA , ์ž ์žฌ ๋””๋ฆฌํด๋ ˆ ํ• ๋‹น

๋ณธ ํฌ์ŠคํŠธ์—์„œ๋Š” NLP ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๋ฉด์„œ ๊ณต๋ถ€ํ•œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•˜์˜€๋‹ค. Topic Modeling์˜ ๋Œ€ํ‘œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ LDA์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

LSA Latent Semantic Analysis

Topic Modeling ๋ถ„์•ผ์— ์•„์ด๋””์–ด๋ฅผ ์ œ๊ณตํ•œ ๊ฑด LSA ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. LDA๋ฅผ ์•Œ์•„๋ณด๊ธฐ ์ „์— ๋จผ์ € LSA ์— ๋Œ€ํ•ด ์ •๋ฆฌํ•˜๊ณ  ๋„˜์–ด๊ฐ€์ž.

LSA๋Š” Topic Modeling ์„ ์œ„ํ•ด ์ตœ์ ํ™”๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์•„๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด์˜ BoW Bag of Words ์— ๊ธฐ๋ฐ˜ํ•œ DTM์ด๋‚˜ TF-IDF ๋ฐฉ๋ฒ•์˜ ๋‹จ์–ด์˜ ๋นˆ๋„์ˆ˜๋งŒ ์ด์šฉํ•˜๊ณ  ์˜๋ฏธ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ๋ชปํ–ˆ๋‹ค๋Š” ํ•œ๊ณ„์ ์„ ๋ณด์™„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ DTM์˜ ์ž ์žฌ๋œ (Latent) ์˜๋ฏธ๋ฅผ ๋ถ„์„ํ•œ๋‹ค๊ณ  ํ•ด์„œ LSA ๋ผ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๋‹ค๋ฅธ ๋ง๋กœ LSI Latent Semantic Indexing ๋ผ๊ณ  ๋ถ€๋ฅด๊ธฐ๋„ ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.

LSA ๋Š” ๋จผ์ € DTM ์ด๋‚˜ TF-IDF ํ–‰๋ ฌ์— ์ ˆ๋‹จ๋œ SVD (truncated SVD) ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ฐจ์›์„ ์ถ•์†Œ์‹œํ‚ค๊ณ , ๋‹จ์–ด๋“ค์˜ ์ž ์žฌ ์˜๋ฏธ๋ฅผ ๋Œ์–ด๋‚ธ๋‹ค.

Truncated SVD ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ–‰๋ ฌ์˜ ํŠน์ด๊ฐ’ ์ค‘ ์ƒ์œ„ t ๊ฐœ๋งŒ ๋‚จ๊ธฐ๊ณ  ๋‚˜๋จธ์ง€๋Š” ๋ชจ๋‘ ์ œ๊ฑฐํ•˜์—ฌ ์ฐจ์›์„ ์ถ•์†Œํ•œ๋‹ค. ์ด๋•Œ t ๋Š” ํ† ํ”ฝ์˜ ๊ฐœ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์ด๋Ÿฐ์‹์œผ๋กœ ๋‚˜์˜จ ๋ฌธ์„œ ๋ฒกํ„ฐ๋“ค๊ณผ ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์„ ํ†ตํ•ด ๋‹ค๋ฅธ ๋ฌธ์„œ์˜ ์œ ์‚ฌ๋„, ๋‹ค๋ฅธ ๋‹จ์–ด์˜ ์œ ์‚ฌ๋„, ๋‹จ์–ด๋กœ๋ถ€ํ„ฐ ๋ฌธ์„œ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

LSA ๋ฅผ ์ด์šฉํ•˜๋ฉด ์‰ฝ๊ณ  ๋น ๋ฅด๊ฒŒ ๊ตฌํ˜„์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋‹จ์–ด์˜ ์ž ์žฌ ์˜๋ฏธ๋ฅผ ์ด๋Œ์–ด๋‚ผ ์ˆ˜ ์žˆ์–ด์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ SVD์˜ ํŠน์„ฑ์ƒ ์ด๋ฏธ ๊ณ„์‚ฐ๋œ LSA ์— ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ค๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์‹œ ๊ณ„์‚ฐํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— LSA ๋Œ€์‹  Word2Vec ๋“ฑ ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋ฒกํ„ฐํ™”ํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋Š” ์ธ๊ณต ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜์˜ ๋ฐฉ๋ฒ•๋ก ์ด ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ๋‹ค.

LSA ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ† ํ”ฝ ๋ชจ๋ธ๋ง ์‹ค์Šต๋„ ํ•ด๋ณด์ž. ์‹ค์Šต ์ฝ”๋“œ๋Š” ์•„๋ž˜์— ์ ์–ด๋†“์•˜๋‹ค.


LDA Latent Dirichlet Allocation

LDA ๋Š” ์ฃผ์–ด์ง„ ๋ฌธ์„œ์— ๋Œ€ํ•˜์—ฌ ๊ฐ ๋ฌธ์„œ์— ์–ด๋–ค ์ฃผ์ œ๋“ค์ด ์กด์žฌํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ํ™•๋ฅ ๋ชจํ˜•์œผ๋กœ, ํ† ํ”ฝ ๋ชจ๋ธ๋ง์˜ ๋Œ€ํ‘œ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ๋Œ€๋žต์ ์ธ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

LDA ๋Š” ๋‹ค์Œ์˜ ์ƒํ™ฉ์„ ๊ฐ€์ •ํ•œ๋‹ค.

  • ๋ฌธ์„œ๋“ค์€ ํ† ํ”ฝ๋“ค์˜ ํ˜ผํ•ฉ์œผ๋กœ ๊ตฌ์„ฑ
  • ํ† ํ”ฝ๋“ค์€ ํ™•๋ฅ  ๋ถ„ํฌ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ์ƒ์„ฑ

LDA ๋Š” ํŠน์ • ํ† ํ”ฝ์— ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•ด์ค€๋‹ค. ์œ„์˜ ๊ทธ๋ฆผ์„ ์˜ˆ์‹œ๋กœ ๋“ค์–ด๋ณด์ž๋ฉด, ๋…ธ๋ž€์ƒ‰ ํ† ํ”ฝ์€ gene, dna, genetic ์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์ด ๋†’์€ ๊ฑธ๋กœ ๋ณด์•„ ์œ ์ „์ž ๊ด€๋ จ ์ฃผ์ œ์ผ ๊ฒƒ์ด๋‹ค. ํ•œํŽธ, ๋ฌธ์„œ๋ฅผ ๋ณด๋ฉด ๋นจ๊ฐ„์ƒ‰, ํŒŒ๋ž€์ƒ‰ ํ† ํ”ฝ์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด๋ณด๋‹ค ๋…ธ๋ž€์ƒ‰ ํ† ํ”ฝ์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด๊ฐ€ ๋” ๋งŽ์€ ๊ฑธ๋กœ ๋ณด์•„ ๋…ธ๋ž€์ƒ‰ ํ† ํ”ฝ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์„ ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฐ์‹์œผ๋กœ LDA๋ฅผ ์ด์šฉํ•ด ๋ฌธ์„œ์˜ ํ† ํ”ฝ์„ ์ถ”์ถœํ•ด๋‚ธ๋‹ค.

โš™๏ธ LDA ์ˆ˜ํ–‰ ๊ณผ์ •

1๏ธโƒฃ ์‚ฌ์šฉ์ž๊ฐ€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—๊ฒŒ ํ† ํ”ฝ์˜ ๊ฐœ์ˆ˜ k ๋ฅผ ์ง€์ •ํ•ด์ค€๋‹ค.

2๏ธโƒฃ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ k ๊ฐœ ์ค‘ ํ•˜๋‚˜์˜ ํ† ํ”ฝ์— ํ• ๋‹นํ•œ๋‹ค.

3๏ธโƒฃ ๋ชจ๋“  ๋ฌธ์„œ์˜ ๋ชจ๋“  ๋‹จ์–ด์— ๋Œ€ํ•˜์—ฌ ๋‹ค์Œ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค.
์–ด๋–ค ๋ฌธ์„œ์—์„œ ๊ฐ ๋‹จ์–ด w ๊ฐ€ ์ž˜๋ชป๋œ ํ† ํ”ฝ์— ํ• ๋‹น, ๋‚˜๋จธ์ง€ ๋‹จ์–ด๋Š” ๋ชจ๋‘ ์˜ฌ๋ฐ”๋ฅธ ํ† ํ”ฝ์— ํ• ๋‹น๋˜์–ด์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์—ฌ ๋‹ค์Œ์˜ 2๊ฐ€์ง€ ๊ธฐ์ค€์— ๋”ฐ๋ผ ์žฌํ• ๋‹น๋œ๋‹ค.

  • p(topic t | document d) : ๋ฌธ์„œ d์˜ ๋‹จ์–ด๋“ค ์ค‘ ํ† ํ”ฝ t์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด๋“ค์˜ ๋น„์œจ
  • p(word w | topic t) : ๊ฐ ํ† ํ”ฝ๋“ค t์—์„œ ํ•ด๋‹น ๋‹จ์–ด w์˜ ๋ถ„ํฌ

LDA ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ† ํ”ฝ ๋ชจ๋ธ๋ง ์‹ค์Šต๋„ ํ•ด๋ณด์ž. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์‹ค์Šต ์ฝ”๋“œ๋Š” ์•„๋ž˜์— ์ ์–ด๋†“์•˜๋‹ค.

๐Ÿค” LSA์™€ LDA์˜ ์ฐจ์ด?

  • LSA๋Š” DTM์„ ์ฐจ์› ์ถ•์†Œํ•˜๊ณ , ์ถ•์†Œ๋œ ์ฐจ์›์—์„œ ๊ทผ์ ‘ ๋‹จ์–ด๋“ค์„ ํ† ํ”ฝ์œผ๋กœ ๋ฌถ๋Š”๋‹ค.
  • LDA๋Š” ๋‹จ์–ด๊ฐ€ ํŠน์ • ํ† ํ”ฝ์— ์กด์žฌํ•  ํ™•๋ฅ ๊ณผ ๋ฌธ์„œ์— ํŠน์ • ํ† ํ”ฝ์ด ์กด์žฌํ•  ํ™•๋ฅ ์„ ๊ฒฐํ•ฉํ™•๋ฅ ๋กœ ์ถ”์ •ํ•˜์—ฌ ํ† ํ”ฝ์„ ์ถ”์ถœํ•œ๋‹ค.



๐Ÿ’ป ์ฝ”๋“œ ์‹ค์Šต

- LSA

scikit-learn์˜ Twenty Newsgroups ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด LSA ์‹ค์Šต์„ ์ง„ํ–‰ํ•ด๋ณด์ž.

ํ•ด๋‹น ๋ฐ์ดํ„ฐ์…‹์€ 20๊ฐœ์˜ ๋‹ค๋ฅธ ์ฃผ์ œ๋ฅผ ๊ฐ€์ง„ ๋‰ด์Šค๊ทธ๋ฃน ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๊ณ , ์ด๋ฅผ ์ด์šฉํ•ด ๋ฌธ์„œ๋ฅผ ์›ํ•˜๋Š” ํ† ํ”ฝ์˜ ์ˆ˜๋กœ ์••์ถ•ํ•˜์—ฌ ๊ฐ ํ† ํ”ฝ ๋‹น ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋‹จ์–ด 5๊ฐœ๋ฅผ ์ถ”์ถœํ•  ๊ฒƒ์ด๋‹ค.

์ฐธ๊ณ  : Wikidocs : ์ž ์žฌ ์˜๋ฏธ ๋ถ„์„

import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

dataset = fetch_20newsgroups(shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes'))
# ๋‰ด์Šค ๊ทธ๋ฃน ๋ฐ์ดํ„ฐ
documents = dataset.data

# ์นดํ…Œ๊ณ ๋ฆฌ
dataset.target_names
news_df = pd.DataFrame({'document':documents})
# ํŠน์ˆ˜ ๋ฌธ์ž ์ œ๊ฑฐ
news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z]", " ")
# ๊ธธ์ด๊ฐ€ 3์ดํ•˜์ธ ๋‹จ์–ด๋Š” ์ œ๊ฑฐ (๊ธธ์ด๊ฐ€ ์งง์€ ๋‹จ์–ด ์ œ๊ฑฐ)
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
# ์ „์ฒด ๋‹จ์–ด์— ๋Œ€ํ•œ ์†Œ๋ฌธ์ž ๋ณ€ํ™˜
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())
# NLTK๋กœ๋ถ€ํ„ฐ ๋ถˆ์šฉ์–ด๋ฅผ ๋ฐ›์•„์˜ค๊ธฐ
stop_words = stopwords.words('english')

# ํ† ํฐํ™”
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split()) 
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
# ์—ญํ† ํฐํ™”
detokenized_doc = []
for i in range(len(news_df)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)

news_df['clean_doc'] = detokenized_doc

# TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features= 1000, # ์ƒ์œ„ 1,000๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ๋ณด์กด 
max_df = 0.5, smooth_idf=True)

X = vectorizer.fit_transform(news_df['clean_doc'])
# Topic Modeling
svd_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=100, random_state=122)
svd_model.fit(X)

# topic ๊ฐœ์ˆ˜
len(svd_model.components_)


- LDA

์ด๋ฒˆ์—” ์•ฝ 15๋…„ ๊ฐ„ ๋ฐœํ–‰๋œ ์˜์–ด ๋‰ด์Šค ๊ธฐ์‚ฌ ์ œ๋ชฉ์„ ๋ชจ์•„๋†“์€ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ scikit learn์˜ LDA ์‹ค์Šต์„ ํ•ด๋ณด๊ฒ ๋‹ค.

์ฐธ๊ณ  : Wikidocs : ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ ์ž ์žฌ ๋””๋ฆฌํด๋ ˆ ํ• ๋‹น ํ•™์Šต

import pandas as pd
import urllib.request
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

urllib.request.urlretrieve("https://raw.githubusercontent.com/ukairia777/tensorflow-nlp-tutorial/main/19.%20Topic%20Modeling/dataset/abcnews-date-text.csv", filename="abcnews-date-text.csv")

data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False)

# ๋‰ด์Šค ์ œ๋ชฉ ๋ฐ์ดํ„ฐ๋งŒ ์ €์žฅ
text = data[['headline_text']]
# ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ
text['headline_text'] = text.apply(lambda row: nltk.word_tokenize(row['headline_text']), axis=1)
stop_words = stopwords.words('english')
text['headline_text'] = text['headline_text'].apply(lambda x: [word for word in x if word not in (stop_words)])

# 3์ธ์นญ ๋‹จ์ˆ˜ -> 1์ธ์นญ /  ๊ณผ๊ฑฐ ํ˜„์žฌํ˜• -> ํ˜„์žฌ
text['headline_text'] = text['headline_text'].apply(lambda x: [WordNetLemmatizer().lemmatize(word, pos='v') for word in x])

# ๊ธธ์ด๊ฐ€ 3 ์ดํ•˜์ธ ๋‹จ์–ด๋Š” ์ œ๊ฑฐ
tokenized_doc = text['headline_text'].apply(lambda x: [word for word in x if len(word) > 3])
# ์—ญํ† ํฐํ™” (ํ† ํฐํ™” ์ž‘์—…์„ ๋˜๋Œ๋ฆผ)
detokenized_doc = []
for i in range(len(text)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)

# ๋‹ค์‹œ text['headline_text']์— ์žฌ์ €์žฅ
text['headline_text'] = detokenized_doc

# ์ƒ์œ„ 1,000๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ๋ณด์กด 
vectorizer = TfidfVectorizer(stop_words='english', max_features= 1000)
# TF-IDF ํ–‰๋ ฌ ๋งŒ๋“ค๊ธฐ
X = vectorizer.fit_transform(text['headline_text'])
# ํ† ํ”ฝ ๋ชจ๋ธ๋ง
lda_model = LatentDirichletAllocation(n_components=10,learning_method='online',random_state=777,max_iter=1)
lda_top = lda_model.fit_transform(X)

# ๋‹จ์–ด ์ง‘ํ•ฉ. 1,000๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ์ €์žฅ๋จ.
terms = vectorizer.get_feature_names()

def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print("Topic %d:" % (idx+1), [(feature_names[i], topic[i].round(2)) for i in topic.argsort()[:-n - 1:-1]])

get_topics(lda_model.components_,terms)
# LDA ์‹œ๊ฐํ™”
# pip install pyLDAvis

import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(ldamodel, corpus, dictionary)
pyLDAvis.display(vis)

๊ฐ ์›๊ณผ์˜ ๊ฑฐ๋ฆฌ๋Š” ๊ฐ ํ† ํ”ฝ๋“ค์ด ์„œ๋กœ ์–ผ๋งˆ๋‚˜ ๋‹ค๋ฅธ์ง€๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ์ฃผ์˜ํ•ด์•ผํ•  ์ ์€ LDA ๋ชจ๋ธ์—์„œ ์ถœ๋ ฅ์„ ํ•˜๋ฉด ํ† ํ”ฝ ๋ฒˆํ˜ธ๊ฐ€ 0๋ถ€ํ„ฐ ๋ถ€์—ฌ๋˜์ง€๋งŒ, ์œ„์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์‹œ๊ฐํ™”๋ฅผ ํ•˜๋ฉด ํ† ํ”ฝ ๋ฒˆํ˜ธ๊ฐ€ 1๋ถ€ํ„ฐ ์‹œ์ž‘๋œ๋‹ค๋Š” ์ ์ด๋‹ค.

# ๋ฌธ์„œ ๋ณ„ ํ† ํ”ฝ ๋ถ„ํฌ ๋ณด๊ธฐ
for i, topic_list in enumerate(ldamodel[corpus]):
    if i==5:
        break
    print(i,'๋ฒˆ์งธ ๋ฌธ์„œ์˜ topic ๋น„์œจ์€',topic_list)
# ๋ฌธ์„œ ๋ณ„ ํ† ํ”ฝ ๋ถ„ํฌ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ๋ณด๊ธฐ
def make_topictable_per_doc(ldamodel, corpus):
    topic_table = pd.DataFrame()

    for i, topic_list in enumerate(ldamodel[corpus]):
        doc = topic_list[0] if ldamodel.per_word_topics else topic_list            
        doc = sorted(doc, key=lambda x: (x[1]), reverse=True)

        # ๋ชจ๋“  ๋ฌธ์„œ์— ๋Œ€ํ•ด์„œ ๊ฐ๊ฐ ์•„๋ž˜๋ฅผ ์ˆ˜ํ–‰
        for j, (topic_num, prop_topic) in enumerate(doc): 
            if j == 0:  # ๊ฐ€์žฅ ๋น„์ค‘์ด ๋†’์€ ํ† ํ”ฝ
                topic_table = topic_table.append(pd.Series([int(topic_num), round(prop_topic,4), topic_list]), ignore_index=True)
            else:
                break
    return(topic_table)
topictable = make_topictable_per_doc(ldamodel, corpus)
topictable = topictable.reset_index() # ๋ฌธ์„œ ๋ฒˆํ˜ธ์„ ์˜๋ฏธํ•˜๋Š” ์—ด(column)๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ธ๋ฑ์Šค ์—ด์„ ํ•˜๋‚˜ ๋” ๋งŒ๋“ ๋‹ค.
topictable.columns = ['๋ฌธ์„œ ๋ฒˆํ˜ธ', '๊ฐ€์žฅ ๋น„์ค‘์ด ๋†’์€ ํ† ํ”ฝ', '๊ฐ€์žฅ ๋†’์€ ํ† ํ”ฝ์˜ ๋น„์ค‘', '๊ฐ ํ† ํ”ฝ์˜ ๋น„์ค‘']
topictable[:10]

์ฐธ๊ณ 

Wikidocs : ์ž ์žฌ ์˜๋ฏธ ๋ถ„์„

Wikidocs : ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ ์ž ์žฌ ๋””๋ฆฌํด๋ ˆ ํ• ๋‹น ํ•™์Šต

ratsgoโ€™s blog for textmining

Latent Semantic Analysis โ€” Deduce the hidden topic from the document

 

Related Posts



๐Ÿ’™ You need to log in to GitHub to write comments. ๐Ÿ’™
If you can't see comments, please refresh page(F5).