C:\>mkbingram.exe
mkbingram: convert ARPA format N-gram to binary format for Julius
Usage: mkbingram.exe [options...] outfile
options:
-nlr file forward N-gram in ARPA format
-nrl file backward N-gram in ARPA format
-d bingramfile Julius binary N-gram file input
-c from to convert character code
-swap swap "<s>" and "</s>"
When both "-nlr" and "-nrl" are specified,
Julius will use the BACKWARD N-gram as main LM
and use the forward 2-gram only at the 1st pass
Library configuration: version 4.4.2
- Language Model
class N-gram support : yes
MBR weight support : no
word id unit : short (2 bytes)
文字コードを変換するにはRev.4.2.3以降のソースコードを、WindowsならばHAVE_WINNLSの識別子を定義してコンパイルします。
C:\>mkbingram.exe -d csj.bingram -c sjis utf-8 csj_utf8.bingram
bingram: csj.bingram
START LOADING
Stat: init_ngram: reading in binary n-gram from csj.bingram
Stat: ngram_read_bin: file version: 5
Stat: ngram_read_bin_v5: this is backward 3-gram file
stat: ngram_read_bin_v5: reading 1-gram
stat: ngram_read_bin_v5: reading 2-gram
stat: ngram_read_bin_v5: reading 3-gram
Stat: ngram_read_bin_v5: reading additional LR 2-gram
Stat: ngram_read_bin: making entry name index
Stat: init_ngram: found unknown word entry "<UNK>"
Stat: init_ngram: finished reading n-gram
N-gram info:
spec = 3-gram, backward (right-to-left)
OOV word = <UNK>(id=0)
wordset size = 38608
1-gram entries = 38608 ( 0.3 MB)
2-gram entries = 310578 ( 3.5 MB) (51% are valid contexts)
3-gram entries = 626673 ( 4.4 MB)
LR 2-gram entries= 310578 ( 1.3 MB)
pass1 = given additional forward 2-gram
Writing in v5 format to "csj_utf8.bingram"...
Stat: ngram_write_bin: wrote 11124114 bytes (10.6 MB)
completed
文字コードの指定方法はcharconv.cにあり、Shift_JISは"sjis"や"sjis-win"、UTF-8は"utf-8"のように記述します。