word2vec bin 파일을 텍스트로 변환

itsource

word2vec bin 파일을 텍스트로 변환

mycopycode 2022. 8. 21. 19:56

word2vec bin 파일을 텍스트로 변환

2vec 사이트에서 Google News-vectors-negative 300.bin.gz를 다운로드할 수 있습니다..bin 파일(약 3.4)GB)는 바이너리 형식이 아닙니다.Tomas Mikolov는 "바이너리 포맷을 텍스트 포맷으로 변환하는 것은 매우 간단할 것입니다(단, 디스크 공간이 더 많이 필요합니다).거리 도구에서 코드를 확인하십시오. 바이너리 파일을 읽는 것은 상당히 간단합니다."유감스럽게도 저는 http://word2vec.googlecode.com/svn/trunk/distance.c을 이해할 수 있을 만큼 C를 모릅니다.

아마 gensim도 할 수 있을 것 같은데, 내가 찾은 모든 튜토리얼은 텍스트에서 변환하는 것이지, 다른 방법은 아닌 것 같다.

누가 C 코드 수정이나 gensim이 텍스트를 출력하도록 지시할 수 있습니까?

이 코드를 사용하여 바이너리 모델을 로드하고 모델을 텍스트 파일에 저장합니다.

from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)

참고 자료: API 및 눌레지.

주의:

위 코드는 새로운 버전의 gensim용입니다. 이전 버전에서는 다음 코드를 사용했습니다.

from gensim.models import word2vec

model = word2vec.Word2Vec.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)

Word2vec-toolkit 메일링 리스트에서 Thomas Mensink는 .bin 파일을 텍스트로 변환하는 작은 C 프로그램의 형태로 답변을 제공했습니다.이것은 distance.c 파일의 수정입니다.원래 distance.c를 아래 Thomas의 코드로 대체하고 word2vec(make clean; make)을 재구축하여 컴파일된 거리를 readbin으로 이름을 변경하였습니다.그리고나서./readbin vector.binvector.bin 텍스트버전이 생성됩니다.

//  Copyright 2013 Google Inc. All Rights Reserved.
//
//  Licensed under the Apache License, Version 2.0 (the "License");
//  you may not use this file except in compliance with the License.
//  You may obtain a copy of the License at
//
//      http://www.apache.org/licenses/LICENSE-2.0
//
//  Unless required by applicable law or agreed to in writing, software
//  distributed under the License is distributed on an "AS IS" BASIS,
//  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
//  See the License for the specific language governing permissions and
//  limitations under the License.

#include <stdio.h>
#include <string.h>
#include <math.h>
#include <malloc.h>

const long long max_size = 2000;         // max length of strings
const long long N = 40;                  // number of closest words that will be shown
const long long max_w = 50;              // max length of vocabulary entries

int main(int argc, char **argv) {
  FILE *f;
  char file_name[max_size];
  float len;
  long long words, size, a, b;
  char ch;
  float *M;
  char *vocab;
  if (argc < 2) {
    printf("Usage: ./distance <FILE>\nwhere FILE contains word projections in the BINARY FORMAT\n");
    return 0;
  }
  strcpy(file_name, argv[1]);
  f = fopen(file_name, "rb");
  if (f == NULL) {
    printf("Input file not found\n");
    return -1;
  }
  fscanf(f, "%lld", &words);
  fscanf(f, "%lld", &size);
  vocab = (char *)malloc((long long)words * max_w * sizeof(char));
  M = (float *)malloc((long long)words * (long long)size * sizeof(float));
  if (M == NULL) {
    printf("Cannot allocate memory: %lld MB    %lld  %lld\n", (long long)words * size * sizeof(float) / 1048576, words, size);
    return -1;
  }
  for (b = 0; b < words; b++) {
    fscanf(f, "%s%c", &vocab[b * max_w], &ch);
    for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f);
    len = 0;
    for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size];
    len = sqrt(len);
    for (a = 0; a < size; a++) M[a + b * size] /= len;
  }
  fclose(f);
  //Code added by Thomas Mensink
  //output the vectors of the binary format in text
  printf("%lld %lld #File: %s\n",words,size,file_name);
  for (a = 0; a < words; a++){
    printf("%s ",&vocab[a * max_w]);
    for (b = 0; b< size; b++){ printf("%f ",M[a*size + b]); }
    printf("\b\b\n");
  }  

  return 0;
}

에서 "\b\b" 를 제거했습니다.printf.

덧붙여서, 결과 텍스트 파일에는 아직 텍스트 워드와 불필요한 공백이 포함되어 있어 수치 계산에는 필요 없었습니다.bash 명령어를 사용하여 각 행에서 첫 번째 텍스트 열과 후행 공백을 제거했습니다.

cut --complement -d ' ' -f 1 GoogleNews-vectors-negative300.txt > GoogleNews-vectors-negative300_tuples-only.txt
sed 's/ $//' GoogleNews-vectors-negative300_tuples-only.txt

형식은 IEEE 754 싱글소문자 바이너리 부동소수점 형식입니다.binary 32 http://en.wikipedia.org/wiki/Single-precision_floating-point_format 에서는 little-endian을 사용합니다.

예를 들어 보겠습니다.

첫 번째 행은 문자열 형식: "3000000 300\n" (vocabSize & vecSize, getByte to byte=='\n')
다음 행에는 먼저 vocab word가 포함되며, 그 다음 (300*4바이트의 부동값, 각 치수에 대해 4바이트)가 포함됩니다.
```
getByte till byte==32 (space). (60 47 115 62 32 => <\s>[space])
```
각 다음 4바이트는 하나의 플로트 번호를 나타냅니다.

다음 4바이트: 0 - 108 58 => 0.001129150390625.

Wikipedia 링크를 통해 방법을 확인할 수 있습니다. 예를 들어 다음과 같습니다.

(리틀 엔디언 -> 역순서) 0011101010000000000000000000000000000

첫 번째는 부호 비트 = > 부호 = 1(소수 = -1)입니다.
다음 8 비트 => 117 => exp = 2 ^ (표준)
다음 23비트 => pre = 0*2^(-1) + 0*2^(-2) + 1*2^(-3) + 1*2^(-5)

값 = 기호 * exp * pre

이진 파일을 word2vec로 로드한 다음 다음과 같이 텍스트 버전을 저장할 수 있습니다.

from gensim.models import word2vec
 model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True)
 model.save("file.txt")

300을 사용하고 News-vectors-negative 300.bin을 포함하고 .binary = True모델을 로드하는 동안 플래그를 설정합니다.

from gensim import word2vec

model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True)

정상적으로 작동하는 것 같습니다.

에러가 표시되는 경우:

ImportError: No module named models.word2vec

API 업데이트가 있었기 때문입니다.이 조작은 유효합니다.

from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('./GoogleNews-vectors-negative300.txt', binary=False)

비슷한 문제가 있었습니다만, bin/non-bin(gensim) 모델의 출력을 CSV로 받고 싶었습니다.

다음은 python에서 이를 수행하는 코드이며, gensim이 설치되어 있다고 가정합니다.

https://gist.github.com/dav009/10a742de43246210f3ba

사용하는 코드는 다음과 같습니다.

import codecs
from gensim.models import Word2Vec

def main():
    path_to_model = 'GoogleNews-vectors-negative300.bin'
    output_file = 'GoogleNews-vectors-negative300_test.txt'
    export_to_file(path_to_model, output_file)


def export_to_file(path_to_model, output_file):
    output = codecs.open(output_file, 'w' , 'utf-8')
    model = Word2Vec.load_word2vec_format(path_to_model, binary=True)
    print('done loading Word2Vec')
    vocab = model.vocab
    for mid in vocab:
        #print(model[mid])
        #print(mid)
        vector = list()
        for dimension in model[mid]:
            vector.append(str(dimension))
        #line = { "mid": mid, "vector": vector  }
        vector_str = ",".join(vector)
        line = mid + "\t"  + vector_str
        #line = json.dumps(line)
        output.write(line + "\n")
    output.close()

if __name__ == "__main__":
    main()
    #cProfile.run('main()') # if you want to do some profiling

convertvec은 Word2vec 라이브러리의 다른 형식 간에 벡터를 변환하는 작은 도구입니다.

벡터를 이진 텍스트에서 일반 텍스트로 변환:

./vec bin2txt input.bin 출력.txt

일반 텍스트에서 이진 텍스트로 벡터 변환:

./vecvec txt2bin 입력.txt 출력.휴지통

지금 바로 업데이트하면 더 쉬운 방법이 있습니다.

「」를 사용하고 word2vechttps://github.com/dav/word2vec 에서는, 라고 하는 추가 옵션이 있습니다.-binary 있다1 또는 바이너리 파일을 0텍스트 파일을 생성합니다.는 " "에서 합니다.demo-word.sh★★★★

time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15

언급URL : https://stackoverflow.com/questions/27324292/convert-word2vec-bin-file-to-text

'itsource' 카테고리의 다른 글

Vue - 래퍼 구성 요소의 프로펠을 재정의하는 방법 (0)	2022.08.21
Spring 3.0 - XML 스키마 네임스페이스용 Spring NamespaceHandler를 찾을 수 없습니다[ http://www.springframework.org/schema/security ] (0)	2022.08.21
루트 메타데이터 키가 true로 설정된 경우 특정 경로에서 구성 요소를 모달로 렌더링합니다. (0)	2022.08.21
vue.js 2의 상위 구성 요소에서 하위 구성 요소의 데이터를 재설정하려면 어떻게 해야 합니까? (0)	2022.08.21
VueJ를 사용하여 Axios 데이터 응답에서 이미지 URL을 다운로드하려면 어떻게 해야 합니까?s (0)	2022.08.21

현재글word2vec bin 파일을 텍스트로 변환

각종 프로그래밍 정보를 다루는 블로그입니다.

JavaScript, Python, vuejs2, sql-server, MongoDB, C, bash, json, Wordpress, spring-boot, oracle, angularJs, java, git, PowerShell, vuex, MySQL, Reactjs, mariadb, php,

Today :
Yesterday :

mycopycode

word2vec bin 파일을 텍스트로 변환

word2vec bin 파일을 텍스트로 변환

'itsource' 카테고리의 다른 글

'itsource'의 다른글

티스토리툴바

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

word2vec bin 파일을 텍스트로 변환

word2vec bin 파일을 텍스트로 변환

'itsource' 카테고리의 다른 글

'itsource'의 다른글

관련글

티스토리툴바