두 텍스트 문서 간의 유사성을 계산하는 방법은 무엇입니까?

itsource

두 텍스트 문서 간의 유사성을 계산하는 방법은 무엇입니까?

mycopycode 2022. 10. 26. 22:39

두 텍스트 문서 간의 유사성을 계산하는 방법은 무엇입니까?

어떤 프로그래밍 언어든 NLP 프로젝트를 진행하려고 합니다(Python이 제 취향이지만).

나는 두 개의 서류를 가져다가 그것들이 얼마나 비슷한지 판단하고 싶다.

일반적인 방법은 문서를 TF-IDF 벡터로 변환한 다음 이들 사이의 코사인 유사도를 계산하는 것입니다.정보 검색(IR)에 관한 모든 교과서에서 이를 다루고 있습니다.특히 를 참조해 주세요.Information Retrieval 소개 - 무료로 온라인으로 이용하실 수 있습니다.

쌍방향 유사성 계산

TF-IDF(및 유사한 텍스트 변환)는 Python 패키지 Gensim 및 skikit-learn에 구현되어 있습니다.후자의 패키지에서는 코사인 유사도를 계산하는 것은 다음과 같이 간단합니다.

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

또는 문서가 일반 문자열일 경우,

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T

하지만 겐심은 이런 종류의 일을 더 할 수 있을 것이다.

이 질문도 참조해 주세요.

[면책자:Scikit-Learn TF-IDF 구현에 참여했습니다.]

결과 해석

에서부터 ★★★★★★★★★★★★★★★★★」pairwise_similarity는 행과 열의 수가 말뭉치 내의 문서 수와 동일한 정사각형 모양의 스키피 스파스 매트릭스입니다.

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

어레이는 sparse NumPy 를 통해 할 수 ..toarray() ★★★★★★★★★★★★★★★★★」.A:

>>> pairwise_similarity.toarray()                                                                                                                                                                                                                            
array([[1.        , 0.17668795, 0.27056873, 0.        , 0.        ],
       [0.17668795, 1.        , 0.15439436, 0.        , 0.        ],
       [0.27056873, 0.15439436, 1.        , 0.19635649, 0.16815247],
       [0.        , 0.        , 0.19635649, 1.        , 0.54499756],
       [0.        , 0.        , 0.16815247, 0.54499756, 1.        ]])

'Scikit-learn docs is Orange and Blue'는 'Scikit-learn docs is Blue'입니다.에서는 색인 로 되어 있습니다.corpus가장 유사한 문서의 인덱스는 해당 행의 argmax를 사용하여 찾을 수 있지만, 먼저 각 문서의 유사성을 나타내는 1을 마스킹해야 합니다.후자를 끝까지 할 수 있습니다.np.fill_diagonal(), 및의 ~ 까지는, ~ 입니다.np.nanargmax():

>>> import numpy as np     
                                                                                                                                                                                                                                  
>>> arr = pairwise_similarity.toarray()     
>>> np.fill_diagonal(arr, np.nan)                                                                                                                                                                                                                            
                                                                                                                                                                                                                 
>>> input_doc = "The scikit-learn docs are Orange and Blue"                                                                                                                                                                                                  
>>> input_idx = corpus.index(input_doc)                                                                                                                                                                                                                      
>>> input_idx                                                                                                                                                                                                                                                
4

>>> result_idx = np.nanargmax(arr[input_idx])                                                                                                                                                                                                                
>>> corpus[result_idx]                                                                                                                                                                                                                                       
'I prefer scikit-learn to Orange'

주의: 스파스 매트릭스를 사용하는 목적은 대량의 말뭉치와 어휘를 위해 (대부분의 공간을) 절약하는 것입니다.NumPy 배열로 변환하는 대신 다음을 수행할 수 있습니다.

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

@larsman과 동일하지만 일부 전처리가 필요합니다.

import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('punkt') # if necessary...


stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

'''remove punctuation, lowercase, stem'''
def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

def cosine_sim(text1, text2):
    tfidf = vectorizer.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0,1]


print cosine_sim('a little bird', 'a little bird')
print cosine_sim('a little bird', 'a little bird chirps')
print cosine_sim('a little bird', 'a big dog barks')

오래된 질문이지만, 스페이시라면 쉽게 할 수 있다는 걸 알았어요.문서를 읽으면 간단한 api가similarity문서 벡터 간의 코사인 유사성을 찾는 데 사용할 수 있습니다.

먼저 패키지를 설치하고 모델을 다운로드합니다.

pip install spacy
python -m spacy download en_core_web_sm

그런 다음 다음과 같이 사용합니다.

import spacy
nlp = spacy.load('en_core_web_sm')
doc1 = nlp(u'Hello hi there!')
doc2 = nlp(u'Hello hi there!')
doc3 = nlp(u'Hey whatsup?')

print (doc1.similarity(doc2)) # 0.999999954642
print (doc2.similarity(doc3)) # 0.699032527716
print (doc1.similarity(doc3)) # 0.699032527716

매우 정확한 것을 찾고 있다면 tf-idf보다 더 좋은 도구를 사용해야 합니다.범용문 인코더는 두 텍스트 사이의 유사성을 찾는 가장 정확한 인코더 중 하나입니다.구글은 처음부터 훈련할 필요 없이 자신의 응용 프로그램에 사용할 수 있는 사전 검증된 모델을 제공했습니다.먼저 텐서플로우 및 텐서플로우 허브를 설치해야 합니다.

    pip install tensorflow
    pip install tensorflow_hub

아래 코드는 텍스트를 고정 길이 벡터 표현으로 변환한 다음 점 곱을 사용하여 유사성을 확인할 수 있습니다.

import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/1?tf-hub-format=compressed"

# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)

# sample text
messages = [
# Smartphones
"My phone is not good.",
"Your cellphone looks great.",

# Weather
"Will it snow tomorrow?",
"Recently a lot of hurricanes have hit the US",

# Food and health
"An apple a day, keeps the doctors away",
"Eating strawberries is healthy",
]

similarity_input_placeholder = tf.placeholder(tf.string, shape=(None))
similarity_message_encodings = embed(similarity_input_placeholder)
with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    message_embeddings_ = session.run(similarity_message_encodings, feed_dict={similarity_input_placeholder: messages})

    corr = np.inner(message_embeddings_, message_embeddings_)
    print(corr)
    heatmap(messages, messages, corr)

및 플롯용 코드:

def heatmap(x_labels, y_labels, values):
    fig, ax = plt.subplots()
    im = ax.imshow(values)

    # We want to show all ticks...
    ax.set_xticks(np.arange(len(x_labels)))
    ax.set_yticks(np.arange(len(y_labels)))
    # ... and label them with the respective list entries
    ax.set_xticklabels(x_labels)
    ax.set_yticklabels(y_labels)

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", fontsize=10,
         rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    for i in range(len(y_labels)):
        for j in range(len(x_labels)):
            text = ax.text(j, i, "%.2f"%values[i, j],
                           ha="center", va="center", color="w", 
fontsize=6)

    fig.tight_layout()
    plt.show()

결과는 다음과 같습니다.

보다시피가장비슷한것은본문자체와의비슷한점이고,그다음에는본문자체와의의미입니다.

중요: 처음 코드를 실행할 때 모델을 다운로드해야 하므로 속도가 느려집니다.모델이 다시 다운로드되지 않도록 하고 로컬 모델을 사용하려면 캐시할 폴더를 생성하여 환경 변수에 추가한 후 처음 실행한 후 해당 경로를 사용해야 합니다.

tf_hub_cache_dir = "universal_encoder_cached/"
os.environ["TFHUB_CACHE_DIR"] = tf_hub_cache_dir

# pointing to the folder inside cache dir, it will be unique on your system
module_url = tf_hub_cache_dir+"/d8fbeb5c580e50f975ef73e80bebba9654228449/"
embed = hub.Module(module_url)

상세정보 : https://tfhub.dev/google/universal-sentence-encoder/2

일반적으로 두 문서 간의 코사인 유사성은 문서의 유사성 척도로 사용됩니다.Java에서는 Lucene(컬렉션 규모가 매우 큰 경우) 또는 LingPipe를 사용하여 이를 수행할 수 있습니다.기본 개념은 모든 문서의 항을 세고 벡터라는 용어의 점곱을 계산하는 것입니다.라이브러리는 이 일반적인 접근법에 대한 몇 가지 개선점을 제공한다. 예를 들어, 역문서 빈도를 사용하고 tf-idf 벡터를 계산한다.copmlex를 실행하려는 경우 LingPipe에서는 문서 간의 LSA 유사도를 계산하는 방법도 제공되므로 코사인 유사도보다 더 나은 결과를 얻을 수 있습니다.Python의 경우 NLTK를 사용할 수 있습니다.

여기 시작할 수 있는 작은 앱이 있습니다.

import difflib as dl

a = file('file').read()
b = file('file1').read()

sim = dl.get_close_matches

s = 0
wa = a.split()
wb = b.split()

for i in wa:
    if sim(i, wb):
        s += 1

n = float(s) / float(len(wa))
print '%d%% similarity' % int(n * 100)

구문적 유사성의 경우 유사성을 탐지하는 세 가지 쉬운 방법이 있습니다.

Word2Vec
장갑
Tfidf 또는 카운트벡터라이저

시멘틱 유사성의 경우 BERT 임베딩을 사용하여 다른 단어 풀링 전략을 사용하여 문서 임베딩을 가져온 다음 문서 임베딩에 코사인 유사성을 적용할 수 있습니다.

고급 방법론에서는 BERT SCORE를 사용하여 유사성을 얻을 수 있습니다.

리서치 페이퍼 링크: https://arxiv.org/abs/1904.09675

매우 적은 데이터 세트로 문장 유사성을 찾고 높은 정확도를 얻으려면 사전 교육된 BERT 모델을 사용하는 아래 python 패키지를 사용할 수 있습니다.

pip install similar-sentences

만약 당신이 두 텍스트의 의미적 유사성을 측정하는 데 더 관심이 있다면, 나는 이 gitlab 프로젝트를 볼 것을 제안합니다.서버로서 실행할 수 있습니다.또, 2개의 텍스트의 유사도를 측정하기 위해서 간단하게 사용할 수 있는 사전 빌드 모델도 있습니다.대부분 2개의 문장의 유사도를 측정하기 위해서 트레이닝을 받고 있습니다만, 그 경우에도 사용할 수 있습니다.Java로 작성되어 있지만 RESTful 서비스로 실행할 수 있습니다.

또 다른 옵션은 DKPro 유사성입니다.이것은 텍스트의 유사성을 측정하는 다양한 알고리즘을 가진 라이브러리입니다.단, 자바어로도 표기되어 있습니다.

코드 예:

// this similarity measure is defined in the dkpro.similarity.algorithms.lexical-asl package
// you need to add that to your .pom to make that example work
// there are some examples that should work out of the box in dkpro.similarity.example-gpl 
TextSimilarityMeasure measure = new WordNGramJaccardMeasure(3);    // Use word trigrams

String[] tokens1 = "This is a short example text .".split(" ");   
String[] tokens2 = "A short example text could look like that .".split(" ");

double score = measure.getSimilarity(tokens1, tokens2);

System.out.println("Similarity: " + score);

코사인 문서 유사성을 위해 이 온라인 서비스를 사용해 보십시오.http://www.scurtu.it/documentSimilarity.html

import urllib,urllib2
import json
API_URL="http://www.scurtu.it/apis/documentSimilarity"
inputDict={}
inputDict['doc1']='Document with some text'
inputDict['doc2']='Other document with some text'
params = urllib.urlencode(inputDict)    
f = urllib2.urlopen(API_URL, params)
response= f.read()
responseObject=json.loads(response)  
print responseObject

@FredFoo와 @Renaud의 답변을 조합하고 있습니다.나의 솔루션은 @Renaud의 전처리를 @FredFoo의 텍스트 코퍼스에 적용하여 유사도가 0보다 큰 쌍으로 표시할 수 있다.먼저 python과 pip을 설치하여 Windows에서 이 코드를 실행하였습니다.pip은 python의 일부로 설치되지만 설치 패키지를 다시 실행하고 modify를 선택한 후 pip을 선택하여 명시적으로 설치해야 할 수 있습니다.명령줄을 사용하여 "similarity.py" 파일에 저장된 python 코드를 실행합니다.다음 명령을 실행해야 했습니다.

>set PYTHONPATH=%PYTHONPATH%;C:\_location_of_python_lib_
>python -m pip install sklearn
>python -m pip install nltk
>py similarity.py

similarity.py 의 코드는 다음과 같습니다.

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk, string
import numpy as np
nltk.download('punkt') # if necessary...

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

corpus = ["I'd like an apple", 
           "An apple a day keeps the doctor away", 
           "Never compare an apple to an orange", 
           "I prefer scikit-learn to Orange", 
           "The scikit-learn docs are Orange and Blue"]  

vect = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf = vect.fit_transform(corpus)   
                                                                                                                                                                                                                    
pairwise_similarity = tfidf * tfidf.T

#view the pairwise similarities 
print(pairwise_similarity)

#check how a string is normalized
print(normalize("The scikit-learn docs are Orange and Blue"))

이 작업 링크에 sentencetransformer를 사용할 수 있습니다.

sbert의 간단한 예는 다음과 같습니다.

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
# Two lists of sentences
sentences1 = ['The cat sits outside']
sentences2 = ['The dog plays in the garden']
#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)
#Output the pairs with their score
for i in range(len(sentences1)):
   print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], 
         sentences2[i], cosine_scores[i][i]))

언급URL : https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents

'itsource' 카테고리의 다른 글

'/var/run/mysqld/mysqld 소켓을 통해 로컬 MySQL 서버에 연결할 수 없습니다.sock'(111 "Connection rejected"), (0)	2022.10.26
Python에서 정수를 문자열로 변환 (0)	2022.10.26
재귀적 콜라츠 추측 함수 프로그램이 출력을 제공하지 않습니다. (0)	2022.10.26
Python 로케일 오류: 지원되지 않는 로케일 설정 (0)	2022.10.26
Java는 Currying을 지원합니까? (0)	2022.10.26

현재글두 텍스트 문서 간의 유사성을 계산하는 방법은 무엇입니까?

각종 프로그래밍 정보를 다루는 블로그입니다.

mariadb, Wordpress, JavaScript, bash, java, MongoDB, json, PowerShell, vuex, C, sql-server, Python, Reactjs, spring-boot, vuejs2, oracle, MySQL, php, angularJs, git,

Today :
Yesterday :

mycopycode

두 텍스트 문서 간의 유사성을 계산하는 방법은 무엇입니까?

두 텍스트 문서 간의 유사성을 계산하는 방법은 무엇입니까?

쌍방향 유사성 계산

결과 해석

'itsource' 카테고리의 다른 글

'itsource'의 다른글

티스토리툴바

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

두 텍스트 문서 간의 유사성을 계산하는 방법은 무엇입니까?

두 텍스트 문서 간의 유사성을 계산하는 방법은 무엇입니까?

쌍방향 유사성 계산

결과 해석

'itsource' 카테고리의 다른 글

'itsource'의 다른글

관련글

티스토리툴바