mkbingram.exe

C:\>mkbingram.exe
mkbingram: convert ARPA format N-gram to binary format for Julius

Usage: mkbingram.exe [options...] outfile

    options:
    -nlr file       forward  N-gram in ARPA format
    -nrl file       backward N-gram in ARPA format
    -d bingramfile  Julius binary N-gram file input
    -c from to      convert character code
    -swap           swap "<s>" and "</s>"

      When both "-nlr" and "-nrl" are specified,
      Julius will use the BACKWARD N-gram as main LM
      and use the forward 2-gram only at the 1st pass

Library configuration: version 4.4.2
 - Language Model
    class N-gram support    : yes
    MBR weight support      : no
    word id unit            : short (2 bytes)

文字コードの変換

文字コードを変換するにはRev.4.2.3以降のソースコードを、WindowsならばHAVE_WINNLSの識別子を定義してコンパイルします。

C:\>mkbingram.exe -d csj.bingram -c sjis utf-8 csj_utf8.bingram
bingram: csj.bingram

START LOADING

Stat: init_ngram: reading in binary n-gram from csj.bingram
Stat: ngram_read_bin: file version: 5
Stat: ngram_read_bin_v5: this is backward 3-gram file
stat: ngram_read_bin_v5: reading 1-gram
stat: ngram_read_bin_v5: reading 2-gram
stat: ngram_read_bin_v5: reading 3-gram
Stat: ngram_read_bin_v5: reading additional LR 2-gram
Stat: ngram_read_bin: making entry name index
Stat: init_ngram: found unknown word entry "<UNK>"
Stat: init_ngram: finished reading n-gram
 N-gram info:
                    spec = 3-gram, backward (right-to-left)
                OOV word = <UNK>(id=0)
            wordset size = 38608
          1-gram entries =      38608  (  0.3 MB)
          2-gram entries =     310578  (  3.5 MB) (51% are valid contexts)
          3-gram entries =     626673  (  4.4 MB)
        LR 2-gram entries=     310578  (  1.3 MB)
                   pass1 = given additional forward 2-gram

Writing in v5 format to "csj_utf8.bingram"...
Stat: ngram_write_bin: wrote 11124114 bytes (10.6 MB)
completed

文字コードの指定方法はcharconv.cにあり、Shift_JISは"sjis"や"sjis-win"、UTF-8は"utf-8"のように記述します。