word2vec

데이터분석/예시코드

word2vec

이규승 2022. 5. 26. 15:05

728x90

자료 제공

https://auto.v.daum.net/v/ENwBGfA3Ky

프리미엄 전기차, BMW i4 eDrive40 시승기

BMW의 2세대 배터리 전기차 i4를 시승했다. 세단도 아닌 그란쿠페라는 장르를 전기차화한 것이 특징이다. 이는 iX와 함께 BMW 가 배터리 전기차에 대해 어떤 생각을 하고 있는지 알 수 있게 해 주는

auto.v.daum.net

# 웹 뉴스 자료 읽어 형태소 분석 후 word2vec을 이용해 단어 간 유사도 확인하기
import pandas as pd
from konlpy.tag import Okt

# 형태소 분석
okt = Okt()

with open('news.txt', mode='r', encoding='utf-8') as f:
    lines = f.read().split('\n')
    
# print(lines)
# print(len(lines))

wordDic = {} # 형태소 분석 후 명사만 추출해 단어별 빈도수 => dict타입으로 확인

for line in lines:
    datas = okt.pos(line) #품사 태깅
    # print(datas)
    for word in datas:
        if word[1] == 'Noun':
            # print(word[0]) # Noun 뽑기
            if not(word[0] in wordDic):
                wordDic[word[0]] = 0
                        
            wordDic[word[0]] +=1
# print(wordDic)

keys = sorted(wordDic.items(), key = lambda x:x[1], reverse = True)
print(keys)

wordList = []
countList = []

for word, count in keys[:20]:
    wordList.append(word)
    countList.append(count)
    
df = pd.DataFrame()
df['word'] = wordList
df['count'] = countList
print(df)

print('--두 글자 이상의 데이터를 읽어 파일로 저장 ----------------')
result = []
results = []
with open('news.txt', mode='r', encoding='utf-8') as f:
    lines = f.read().split('\n')
    
for line in lines:
    datas = okt.pos(line, stem = True) #품사 태깅 # stem = True 원형 어근 출력. 한가한 -> 한가하다.
    imsi = []
    # print(datas)
    for word in datas:
        if not word[1] in ['Number','Alpha','Foreign','Josa','Punctuation','Determiner','Modifier']:
            if len(word) >= 2:
                imsi.append(word[0])
    # print(imsi)
    imsi2 = ("".join(imsi).strip())
    results.append(imsi2)
    
print(results)

fileName = 'news2.txt'
with open(fileName, 'w', encoding='utf-8') as fw:
    fw.write('\n'.join(results))

print('--- word2vec으로 밀집벡터를 만든 후 단어 유사도 확인 --- ')
from gensim.models import word2vec

lineObj = word2vec.LineSentence(fileName)

model = word2vec.Word2Vec(sentences=lineObj, vector_size=100, min_count=1, sg=0)
# sg = 0 : CBOW - 주변단어로 중심단어 예측, sg = 1 : Skip-gram - 중심단어로 주변단어 예측

print(model)

print(model.wv.most_similar(positive=['전기']))
print(model.wv.most_similar(positive=['전기'], topn=3))
print(model.wv.most_similar(positive=['전기', '메르세데스'], topn=3))
print(model.wv.most_similar(negative=['전기']))

728x90

저작자표시

'데이터분석 > 예시코드' 카테고리의 다른 글

웹 스크래핑 : 네이버 영화 평점 (0)	2022.06.01
웹 스크래핑 : 기초 (0)	2022.05.29
RNN을 이용한 텍스트 생성 (0)	2022.05.27
CountVectorizer, TfidfVectorizer (0)	2022.05.26
회귀 분석 (0)	2022.05.04

현재글word2vec

혼자 공부하는 방

앙상블, Ensemble, 분류모델, rnn, 프로그래머스, 빅데이터분석기사 실기, countvectorizer, sqld, Keras, TfidfVectorizer, Logistic Regression, konlpy, sklearn, 선형회귀모델, DBSCAN, lstm, 텐서플로, SVM, 배깅, TensorFlow,

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

혼자 공부하는 방

word2vec

'데이터분석 > 예시코드' 카테고리의 다른 글

'데이터분석/예시코드'의 다른글

티스토리툴바

word2vec

'데이터분석 > 예시코드' 카테고리의 다른 글

'데이터분석/예시코드'의 다른글

관련글

티스토리툴바