Recently doing input method thesaurus,Implement new sentence input model,(Based on input model sentence the word I would talk back before),The new model is based on the input sentence HMM (I.e.Hidden Markov Models) To do,Of course,Due to limited funds, I personally equipment,Only the second order matrix。But even so,Model still need training。
of course,Not to say that with the novels and to train bad,Just difficult to find businesses related fiction,after all,Area they cover too single,This is not really a high quality corpus。Speaking of high-quality,The natural non-none other than Wikipedia,From now on,We have to get information from Wikipedia all Chinese,And export them as a corpus,Used to model training。
Download Data
Do not write reptiles to climb,Wikipedia is open,So they provide their own links to the download package,Really very intimate。download linkyes:https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 。
This is the official regularly dump out xml format data,Download then basically is about 1GB,Chinese content does too little ah。If you unzip,It is more than a 6GB xml file,But do not go silly to decompress,Wikipedia yourself gives you a handy tool to export content。
Data output
Here we use python To perform the export,First, download the tools provided by Wikipedia GENSIM
1 |
pip3 install gensim |
After a successful installation,Probably would writepythonscript
⚠️ here Note that the code only applies to Python 3
1 2 3 4 5 6 7 |
import gensim input_file = "./article/zhwiki-latest-pages-articles.xml.bz2" f = open('./article/zhwiki.txt',encoding='utf8',mode='w') wiki = gensim.corpora.WikiCorpus(input_file, lemmatize=False, dictionary={}) for text in wiki.get_texts(): str_line = ' '.join(text) f.write(str_line+'\n') |
UserWarning: Pattern library is not installed, lemmatization won't be available. For this warning,Ignore it,We do not use it。
at me 2015 On the model year with 13-inch rmbp ran about ten minutes just fine,very slow,To wait a long time,Probably about half an hour,Export of data is 950M 1.09G Text,Each article row。
Organize text
Ok,Directly exported text has been too difficult to use an ordinary text editor to open the,But apparently,We included the so-called Chinese Simplified Traditional ...... they still should be treated at the,Here I will Traditional to Simplified,Of course, in turn row,We use OpenCC To complete the job。
Installation OpenCC
Of course,I macOS platform,Directly command a key installation: brew install opencc
After installation,Also you need to write a configuration file,Write in the same directory to your corpus:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
{ "name": "Traditional Chinese to Simplified Chinese", "segmentation": { "type": "mmseg", "dict": { "type": "ocd", "file": "TSPhrases.ocd" } }, "conversion_chain": [{ "dict": { "type": "group", "dicts": [{ "type": "ocd", "file": "TSPhrases.ocd" }, { "type": "ocd", "file": "TSCharacters.ocd" }] } }] } |
Save it as zht2zhs_config.json spare。
Change
Then execute the command in the current directory
1 |
opencc -i zhwiki.txt -o zhswiki.txt -c zht2zhs_config.json |
This time the OK。
in conclusion
Thus,We get a simplified content,No punctuation marks and numbers of Chinese Wikipedia corpus,Throw it in training machine to read it -
Further reading
Chinese Wikipedia, one of the text data (data acquisition and pre-processing) Analysis
Word2Vec experiments on the English Wikipedia corpus
Original article written by LogStudio:R0uter's Blog » Being Chinese Wikipedia corpus
Reproduced Please keep the source and description link:https://www.logcg.com/archives/2240.html
Hello
I tried your program
But they were in line 6:
str_line = bytes.join(b’ ‘, text).decode()
The following error occurred
sequence item 0: expected a bytes-like object, str found
Looks like the form problem
Internet to find any relevant answers
But a small change still reported the same mistakes
How to solve this?
str_line = text
Try to change that?
Become such a return
can only concatenate list (not “str”) to list
Do not worry emmm,I'm going to download a copy at your back to help testing code,Then I will update this article and inform you。
Hello there,I've corrected the code,They changed the behavior api,Now it directly into the text of the content processing,This is more convenient,Price is the processing speed more slowly ......
You should now be able to export in accordance with the text of the code。
Thank you for a really successful
Enthusiastic and patient guidance and assistance