Moses official website actually has macOS Binary packageof,You don't need to compile them from source。But short,Since Moses developers no longer use Macs,So he can't update,This leads to a bug in the code of the latest version (4.0),Make the binary file can not be used directly,The author said, "It's not difficult to compile from source anyway..." But in short,It is almost impossible to compile Moses from BigSur,Various strange errors,Headache。
In fact,We can directly correct the errors in the binary file,Run directly。
Fix the error
Direct download of Moses binary file,If you execute any one, you will encounter the following error:
1 |
'./moses/bin/consolidate' terminated by signal SIGKILL (Forced quit) |
as well as:
1 2 3 4 |
dyld: Library not loaded: /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_xmltok.3.39.dylib Referenced from: ~/MosesTest/moses/bin/biconcor Reason: image not found Abort trap: 6 |
Let's deal with the second error first,This error is a bit complicated,Obviously,These binary files are linked to a non-existent dynamic link library,This is tricky,Because the code has been packaged into a binary file,Can't we directly modify the code or compiler parameters to fix this error...?
analyse
Use the command otool -L ./moses/bin/consolidate View consolidate This executable,We get the following result:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
./moses/bin/consolidate: /opt/local/lib/libbz2.1.0.dylib (compatibility version 1.0.0, current version 1.0.6) /opt/local/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.11) /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_xmltok.3.39.dylib (compatibility version 0.0.0, current version 0.0.0) /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_xmlparse.3.39.dylib (compatibility version 0.0.0, current version 0.0.0) /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_util++.8.39.dylib (compatibility version 0.0.0, current version 0.0.0) /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_util.3.39.dylib (compatibility version 0.0.0, current version 0.0.0) /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_server_abyss++.8.39.dylib (compatibility version 0.0.0, current version 0.0.0) /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_server_abyss.3.39.dylib (compatibility version 0.0.0, current version 0.0.0) /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_server++.8.39.dylib (compatibility version 0.0.0, current version 0.0.0) /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_server.3.39.dylib (compatibility version 0.0.0, current version 0.0.0) /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_abyss.3.39.dylib (compatibility version 0.0.0, current version 0.0.0) /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc++.8.39.dylib (compatibility version 0.0.0, current version 0.0.0) /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc.3.39.dylib (compatibility version 0.0.0, current version 0.0.0) /Users/hieu/workspace/cmph-2.0/lib/libcmph.0.dylib (compatibility version 1.0.0, current version 1.0.0) /Users/hieu/workspace/irstlm/irstlm-5.80.08/trunk/lib/libirstlm.0.dylib (compatibility version 1.0.0, current version 1.0.0) /opt/local/lib/libiconv.2.dylib (compatibility version 9.0.0, current version 9.0.0) /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 400.9.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1252.0.0) |
This is all the dynamic libraries linked to this file,In theory, we only need to change those non-existent paths to existing paths to make this program run normally.,There are two steps:
- Find the dynamic library that is really linked;
- Change the link library address of this binary file。
For the second point, we can use install_name_tool This Xcode comes with commands to complete,The first point is to find xmlrpc-c Got it,Best be 1.39.07 This version (I have tried to install the latest version directly with brew,But because it's so new,Two dynamic link libraries have been directly removed,So it’s better to have the same version,Ensure that the specific API remains unchanged)。Fortunately, xmlrpc-c has an official historical version,We can start fromHereDownload 1.39.07 Source code for this version,Compile。
Fix the error
To compile xmlrpc-c,Need to use gcc-10,If you haven't installed it,You can use the command brew install gcc A key installation,Then compile and install:
1 2 3 |
./configure --prefix=/usr/local/lib/xmlrpc/ make make install |
Where you can install it at will,But remember this address,I will find this path later。
The next step is to modify the link address:
1 |
install_name_tool -change /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_xmltok.3.39.dylib /usr/local/lib/xmlrpc/libxmlrpc_xmltok.3.39.dylib ./moses/bin/consolidate |
For example for libxmlrpc_xmltok.3.39.dylib This one,We just replace it like this,Become a truly usable dynamic link library。But each binary file has 11 wrong links that need to be replaced... It is still a bit troublesome to handle manually,So I wrote a simple script,You can copy it down and write a .sh In the file,And then use the sh xxx.sh ./moses/bin This form is used to replace the corresponding binary file,This script can directly process all executable files in a given directory:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
#!/bin/bash updateLink() { install_name_tool -change /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_xmltok.3.39.dylib /usr/local/lib/xmlrpc/libxmlrpc_xmltok.3.39.dylib ${1} install_name_tool -change /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_xmlparse.3.39.dylib /usr/local/lib/xmlrpc/libxmlrpc_xmlparse.3.39.dylib ${1} install_name_tool -change /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_util++.8.39.dylib /usr/local/lib/xmlrpc/libxmlrpc_util++.8.39.dylib ${1} install_name_tool -change /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_util.3.39.dylib /usr/local/lib/xmlrpc/libxmlrpc_util.3.39.dylib ${1} install_name_tool -change /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_server_abyss++.8.39.dylib /usr/local/lib/xmlrpc/libxmlrpc_server_abyss++.8.39.dylib ${1} install_name_tool -change /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_server_abyss.3.39.dylib /usr/local/lib/xmlrpc/libxmlrpc_server_abyss.3.39.dylib ${1} install_name_tool -change /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_server++.8.39.dylib /usr/local/lib/xmlrpc/libxmlrpc_server++.8.39.dylib ${1} install_name_tool -change /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_server.3.39.dylib /usr/local/lib/xmlrpc/libxmlrpc_server.3.39.dylib ${1} install_name_tool -change /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc_abyss.3.39.dylib /usr/local/lib/xmlrpc/libxmlrpc_abyss.3.39.dylib ${1} install_name_tool -change /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc++.8.39.dylib /usr/local/lib/xmlrpc/libxmlrpc++.8.39.dylib ${1} install_name_tool -change /Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.39.07/lib/libxmlrpc.3.39.dylib /usr/local/lib/xmlrpc/libxmlrpc.3.39.dylib ${1} install_name_tool -change /Users/hieu/workspace/irstlm/irstlm-5.80.08/trunk/lib/libirstlm.0.dylib /usr/local/lib/xmlrpc/libxmlrpc.3.39.dylib ${1} } for file in [crayon-678f78982c1b8434788962 inline="true" ]ls -F ${1} | grep "*" |
do
echo “processing $file …”
updateLink “${1}/${file}”
done
[/crayon]
After repairing the error command,We can use Moses normally。
Here we talk about the first error,It is the newly introduced gatekeeper of macOS,This security mechanism will prevent you from running any unsigned binary files by default...obviously,All binaries in Moses are unsigned... anyway,Since we have modified these binary files with a script,Now the system thinks they are generated by ourselves,So it won’t block the operation anymore。But just in case,If you met,Just use Finder to go to the directory,Right click on it,Select "Open"。 Then the system will use its own terminal to run the binary file,Then go back to your terminal and re-execute this command,It can be executed normally (each new command must be processed once,Fortunately, it’s only needed for the first time,Don't need it in the future。)
Note that there is a libirstlm.0.dylib We didn’t actually generate,But it doesn't matter,Because we didn't use this library,Just replace its dependencies with any path that can be found.,As long as you don’t use this part of the function,Then there will be no impact in theory~
Prepare data
Here we useUnited Nations Public Parallel Corpusconduct experiment,Because the data is too large,Here are only Chinese and English 60 Ten thousand lines,The data downloaded is tar.gz Subcontracting files,Here we use cat UNv1.0.in-zh.tar.gz.* >>a.in-zh.tar.gz Command to merge subcontracted files,Then unzip。
Here we will name the intercepted data as en60w.txt and zh60w.txt ,These two files are basically one sentence per line,The format of the two files is the same,Consistent content,The only difference is the language。
Participle
We need to segment the data,Note that English corpus also needs word segmentation,This will separate some punctuation marks from English words or numbers,Convenient for follow-up operations。
Chinese word segmentation,We use jieba:
1 |
python3 -m jieba -d " " zh60w.txt > zh60w_cuted.txt |
Note that this is used -d " " This parameter,Change jieba's default slash word breaker to space。So we get the result of word segmentation zh60w_cuted.txt 。
Segmentation of English words,Use Moses' own tools:
1 |
./moses/scripts/tokenizer/tokenizer.perl -l en -lines 20000 -time -threads 6 < en60w.txt > en60w.tok.txt |
Here i used -time To show the final time consumption,use -threads 6 Indicate the use of multithreading to speed up processing,use -lines 20000 Set each thread to process each time 20000 Row,Default is 2000. So we get the English word segmentation result en60w.took.txt 。
At this point, en60w.txt and zh60w.txt Can be deleted。
Handle case
Change all uppercase in English data to lowercase,This helps speed up the translation,We first need to train Truecase,And then use it to quickly process the corpus:
1 2 |
./moses/scripts/recaser/train-truecaser.perl --model truecase-model.en --corpus en60w.tok.txt ./moses/scripts/recaser/train-truecaser.perl --model truecase-model.cn --corpus zh60w.tok.txt |
So we get truecase-model.in and truecase-model.cn Two models,Then we use these two models to process the segmented corpus:
1 2 |
./moses/scripts/recaser/truecase.perl --model truecase-model.en < en60w.tok.txt > en-zh60w.true.en ./moses/scripts/recaser/truecase.perl --model truecase-model.cn < zh60w.tok.txt > en-zh60w.true.cn |
So we get the processed in-zh60w.true.in and in-zh60w.true.cn ,Note that it starts here,Our naming has certain rules,Because subsequent commands will use。
At this point, en60w.took.txt and zh60w.took.txt Can be deleted。
Remove long sentences
At last,Let's trim the corpus again,For example, a sentence that is too long will significantly slow down the training speed and affect the final accuracy:
1 |
./moses/scripts/training/clean-corpus-n.perl en-zh60w.true cn en en-zh60w.clean 1 50 |
Such,We got it again in-zh60w.clean.cn and in-zh60w.clean.in These two cleaned corpus files。
Generative language model
The language model is used to ensure that the translated content is fluent and readable:
1 |
./moses/bin/lmplz -o 3 < en-zh60w.true.cn > en-zh60w.arpa.cn |
Then compress the generated model into binary,Speed up queries:
1 |
./moses/bin/build_binary en-zh60w.arpa.cn en-zh60w.blm.cn |
So we get in-zh60w.blm.cn This model file, in-zh60w.barley.cn Can be deleted。
Use commands to test the model: echo "I love Beijing Tiananmen" | ./moses/bin/query in-zh60w.blm.cn Get output:
1 2 3 4 5 6 |
我=23055 2 -2.4262204 爱=3881 1 -6.498771 北京=14065 1 -4.4601955 天安门=33807 1 -6.3538356 </s>=2 1 -2.6495044 Total: -22.388527 OOV: 0 Perplexity including OOVs: 30040.377320675794 Perplexity excluding OOVs: 30040.377320675794 OOVs: 0 Tokens: 5 RSSMax:166760448 kB user:0.002757 sys:0.098377 CPU:0.101134 real:0.092774 |
At this point, in-zh60w.true.in and in-zh60w.true.cn Can be deleted。
Training the translation model
From now on,Everything is ready,We can start training the translation model:
1 2 |
mkdir working cd working |
We first create a separate directory,Execute training commands here:
1 |
../moses/scripts/training/train-model.perl -root-dir train -corpus ../en-zh60w.clean -f en -e cn -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:/【这里需要使用绝对路径】/Downloads/MosesTest/lm/en-zh60w.blm.cn:8 -external-bin-dir ../bin/training-tools -cores 6 -mgiza -mgiza-cpus 6 |
Pay attention here -mgiza -mgiza-cpus 6 Must be added,Because the macOS package of Moses only contains the tool mgiza,If you don't use it,Will use a single-threaded processing tool by default,Eventually cause the command not to be found and an error。among them -mgiza-cpus 6 Indicates that I want to use 6 Threads for training。
Tuning parameters
Model training is complete,But now the hyperparameters are all default values,Not optimal,We need to tune the parameters。
First take another interception from the parallel corpus downloaded at the beginning 10 Linguistics,Here I have intercepted the first 60 Million to 70 100,000 rows of data,Save as opt.in and opt.cn These two small corpora,We will use this 10 Ten thousand data as a debugging set for parameter tuning。
Word segmentation and handling case
Still similar steps,Process the data:
1 2 3 4 5 6 7 |
python3 -m jieba -d " " opt.cn > opt.tok.cn ./moses/scripts/tokenizer/tokenizer.perl -l en -lines 2000000 -time < opt.en > opt.tok.en ./moses/scripts/recaser/truecase.perl --model truecase-model.cn < opt.tok.cn > opt.true.cn ./moses/scripts/recaser/truecase.perl --model truecase-model.en < opt.tok.en > opt.true.en rm opt.en opt.cn |
Now we get opt.true.cn and opt.true.in ,Now it can be used to tune the parameters。
Tuning
1 2 |
cd working ../moses/scripts/training/mert-moses.pl ../opt.true.en ../opt.true.cn ../moses/bin/moses train/model/moses.ini --mertdir /【绝对路径】/moses/bin/ --multi-moses --multi-moses --decoder-flags='-threads 6' |
--is dead /[absolute path]/moses/bin/ Pay attention to this parameter,To be written as an absolute path,Although the program can also be written as a relative path,But after the end of the tuning, the export script cannot be generated correctly,Will cause the actual directory to be misplaced。
Here we use --multi-mosesParameters open multiple processes and use parameters --decoder-flags='-threads 6' Send instructions directly to decoder,Means to use 6 Processes,Speed up processing。
Although the parameter name here is threads,But in fact the format is like this Number of processes:Number of threads per process:Number of threads in additional processes ,If you just gave me a number like me,Then it means 6:1:0 ,That is, 6 processes, 1 thread per process,No extra process。This is the fastest and most memory-consuming solution。
Not used in actual measurement --multi-mosesEven if you set 6 threads,Actually only 2 threads are used,Occupies 4.2GB of memory (this size is based on the size of different models,You have to deal with it according to your actual situation,For example, my memory is 32GB,After getting this memory footprint,You can end the process,Then start again with 6 processes,Accelerated processing)。The tuning process is very, very slow,My suggestion is that you choose a sample that is a multiple of 10 for tuning,In this way, the current progress can be calculated based on the quantity。
The program will first filter the model based on the data you want to test,Remove items that are definitely not needed in the model,This will greatly increase the loading speed without affecting the test results,But in actual use, please do not use this filtered special model,And if the test data is replaced, the filtered model must be regenerated。
note,Tuning will not stop automatically,It will iterate over and over again,Stop it when you think it's almost the same,And use the best result。
Binary model compression
The generated model is textual,We can compress the model,Generate binary data,This can greatly improve the loading speed of Moses。We create a directory to store the generated binary model: mkdir working/binarised-model
Then use two commands to generate two model files:
1 2 3 |
../moses/bin/processPhraseTableMin -in train/model/phrase-table.gz -nscores 4 -out binarised-model/phrase-table ../moses/bin/processLexicalTable -in train/model/reordering-table.wbe-msd-bidirectional-fe.gz -out binarised-model/reordering-table |
will train/model/moses.this copy to binarised-model/moses.this
Edit it,turn up # feature functions This piece, LexicalReordering Parameters in this field path= for binarised-model/reordering-table Absolute path, PhraseDictionaryMemory Change this field to PhraseDictionaryCompact And the parameters path= To binarised-model/phrase-table.minphr Absolute path。
Then we can use the command ../moses/bin/moses -f binarised-model/moses.this Let's start Moses。
Batch test
Batch testing also needs corresponding parallel corpus,English for translation,Chinese used for the final comparison accuracy。Use the same English tokenizer.perl Word segmentation,Chinese should be segmented with jieba and other thesaurus,And then use the truecase.perl To process。
Prepare the model
Same,We first filter the model against the test set,Remove items that are not used at all,This can greatly speed up the test without affecting the results:
1 2 |
cd working ../moses/scripts/training/filter-model-given-input.pl filtered mert-work/run4.moses.ini ../opt.true.en -Binarizer ../moses/bin/processPhraseTableMin |
Here I directly used the tuned data to test。
Batch processing
Use command to make moses translate everything in batch:
1 |
./moses/bin/moses -f working/filtered/moses.ini -i < opt.true.en > translated.cn 2> test.out |
Calculate BLEU
BLEU is an algorithm for judging the accuracy of translation results,The result is a percentage:
1 |
./moses/scripts/generic/multi-bleu.perl -lc opt.true.cn translated.cn |
For example, according to the corpus in this example,We get the result as follows:
1 |
BLEU = 26.53, 63.3/32.7/19.4/12.3 (BP=1.000, ratio=1.000, hyp_len=2354049, ref_len=2353369) |
Reference links
- it should: Library not loaded … Reason: Image not found
- The construction and operation of the statistical machine translation system Moses
Original article written by LogStudio:R0uter's Blog » Run and train Moses on macOS
Reproduced Please keep the source and description link:https://www.logcg.com/archives/3487.html