Being Chinese Wikipedia corpus

Recently doing input method thesaurus,Implement new sentence input model,(Based on input model sentence the word I would talk back before),The new model is based on the input sentence HMM (I.e.Hidden Markov Models) To do,Of course,Due to limited funds, I personally equipment,Only the second order matrix。But even so,Model still need training。

of course,Not to say that with the novels and to train bad,Just difficult to find businesses related fiction,after all,Area they cover too single,This is not really a high quality corpus。Speaking of high-quality,The natural non-none other than Wikipedia,From now on,We have to get information from Wikipedia all Chinese,And export them as a corpus,Used to model training。

Download Data

Do not write reptiles to climb,Wikipedia is open,So they provide their own links to the download package,Really very intimate。download linkyes:https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 。

This is the official regularly dump out xml format data,Download then basically is about 1GB,Chinese content does too little ah。If you unzip,It is more than a 6GB xml file,But do not go silly to decompress,Wikipedia yourself gives you a handy tool to export content。

Data output

Here we use python To perform the export,First, download the tools provided by Wikipedia GENSIM

After a successful installation,Probably would writepythonscript

⚠️ here Note that the code only applies to Python 3

UserWarning: Pattern library is not installed, lemmatization won't be available. For this warning,Ignore it,We do not use it。

at me 2015 On the model year with 13-inch rmbp ran about ten minutes just fine,very slow,To wait a long time,Probably about half an hour,Export of data is 950M 1.09G Text,Each article row。

Direct export Wikipedia Chinese documents

Direct export Wikipedia Chinese documents

Organize text

Ok,Directly exported text has been too difficult to use an ordinary text editor to open the,But apparently,We included the so-called Chinese Simplified Traditional ...... they still should be treated at the,Here I will Traditional to Simplified,Of course, in turn row,We use OpenCC To complete the job。

Installation OpenCC

Of course,I macOS platform,Directly command a key installation: brew install opencc

After installation,Also you need to write a configuration file,Write in the same directory to your corpus:

Save it as zht2zhs_config.json spare。

Change

Then execute the command in the current directory

This time the OK。

Translate to English content

Translate to English content

in conclusion

Thus,We get a simplified content,No punctuation marks and numbers of Chinese Wikipedia corpus,Throw it in training machine to read it -

Further reading

Chinese Wikipedia, one of the text data (data acquisition and pre-processing) Analysis

Word2Vec experiments on the English Wikipedia corpus

Original article written by LogStudio:R0uter's Blog » Being Chinese Wikipedia corpus

Reproduced Please keep the source and description link:https://www.logcg.com/archives/2240.html

About the Author

R0uter

The non-declaration,I have written articles are original,Reproduced, please indicate the link on this page and my name。

Comments

  1. Hello
    I tried your program
    But they were in line 6:
    str_line = bytes.join(b’ ‘, text).decode()
    The following error occurred
    sequence item 0: expected a bytes-like object, str found

    Looks like the form problem
    Internet to find any relevant answers
    But a small change still reported the same mistakes
    How to solve this?

        1. Hello there,I've corrected the code,They changed the behavior api,Now it directly into the text of the content processing,This is more convenient,Price is the processing speed more slowly ......
          You should now be able to export in accordance with the text of the code。

Leave a Reply

Your email address will not be published. Required fields are marked *