C:\>mkbingram.exe mkbingram: convert ARPA format N-gram to binary format for Julius Usage: mkbingram.exe [options...] outfile options: -nlr file forward N-gram in ARPA format -nrl file backward N-gram in ARPA format -d bingramfile Julius binary N-gram file input -c from to convert character code -swap swap "<s>" and "</s>" When both "-nlr" and "-nrl" are specified, Julius will use the BACKWARD N-gram as main LM and use the forward 2-gram only at the 1st pass Library configuration: version 4.4.2 - Language Model class N-gram support : yes MBR weight support : no word id unit : short (2 bytes)
文字コードを変換するにはRev.4.2.3以降のソースコードを、WindowsならばHAVE_WINNLSの識別子を定義してコンパイルします。
C:\>mkbingram.exe -d csj.bingram -c sjis utf-8 csj_utf8.bingram bingram: csj.bingram START LOADING Stat: init_ngram: reading in binary n-gram from csj.bingram Stat: ngram_read_bin: file version: 5 Stat: ngram_read_bin_v5: this is backward 3-gram file stat: ngram_read_bin_v5: reading 1-gram stat: ngram_read_bin_v5: reading 2-gram stat: ngram_read_bin_v5: reading 3-gram Stat: ngram_read_bin_v5: reading additional LR 2-gram Stat: ngram_read_bin: making entry name index Stat: init_ngram: found unknown word entry "<UNK>" Stat: init_ngram: finished reading n-gram N-gram info: spec = 3-gram, backward (right-to-left) OOV word = <UNK>(id=0) wordset size = 38608 1-gram entries = 38608 ( 0.3 MB) 2-gram entries = 310578 ( 3.5 MB) (51% are valid contexts) 3-gram entries = 626673 ( 4.4 MB) LR 2-gram entries= 310578 ( 1.3 MB) pass1 = given additional forward 2-gram Writing in v5 format to "csj_utf8.bingram"... Stat: ngram_write_bin: wrote 11124114 bytes (10.6 MB) completed
文字コードの指定方法はcharconv.cにあり、Shift_JISは"sjis"や"sjis-win"、UTF-8は"utf-8"のように記述します。