5 years ago,I have written aBased on Dynamic Programming sentence input methodArticles,The problem of pinyin splitting is mentioned at the end of the article,Because the drop-off input method was mainly aimed at Shuangpin,Splitting is not actually required,Just take it apart in two。(This is another reason why I admire Shuangpin,After all, there is one less technical difficulty)
Later, drop-off input method supported the full spell,And start to optimize the whole spell,Only found the original Pinyin participle,Even more difficult than Chinese word segmentation。
Many people mention Pinyin participles,First of all, I thought of making an analogy with English word segmentation.,Not really accurate,Although they are similar in form,such as English letters,separated by spaces,Single characters are meaningless, etc.,But the number of English vocabulary is very large,And there are only hundreds of Chinese pinyin... So in fact, it can really correspond,It's Chinese word segmentation - at least it's all language,Not so much weight,context makes sense。
popular programs online
If you search for "Pinyin Split Algorithm" on the Internet, most of your needs are search engines.,No one studies pinyin splitting for input methods... these splitting schemes,Both have a fatal problem - splitting is lossy,Cannot handle ambiguous splits。
such as:
- How to implement an efficient pinyin matching library? Solve polyphonic words,Problems with matching initials
- Python's Pinyin Split
- Pinyin segmentation or Pinyin word segmentation
- Split strings based on Trie tree into pinyin syllables(one):Building a Pinyin Syllable Model
- ……
problem lies in
all these programs,Both are "insert spaces in pinyin strings",split a possibility。But in fact the splitting of pinyin is ambiguous,such as zhanan , can be zha'in (scumbag),but can also be zhan'an (standing press),Both are reasonable and legal pinyin,If you think the latter is not commonly used (that's not a word at all),Then let's change an example fangan ,it can be fan'gan (disgusted) can also be fang'an (Program),There are too many pinyin combinations like this... a disaster for Chinese input methods。
"Insert space in pinyin string" refers to the final destination form,not the algorithm itself。
Another is to change the length to split the difference,as typical xian can also be xi'an , lian can also be at the'an ……
easy way
Actually, if you look closely,You will find that the problem is very regular,take fangan for example,If we split using the longest match principle,That's right fang'an ,the reverse is fan'gan ,perfect。But the reverse split is easy to go wrong,such as spelling susongan this phrase,Forward split is his'song'an , Then the reverse split becomes his's'o'n'gan ,totally failed。
transition matrix
Ok,After the above plan does not work,I thought of another way - since Chinese, I use the transition matrix to record the transition probability and then solve it,Then why can't we do the same here in Pinyin split? Although the cost is that it is a bit slow to solve twice。
So I counted the pinyin transfer of all the corpus...as mentioned above,Pinyin is not English after all,it has too few units,So there is no difference between transferring first-order and second-order,but overall better than longest match,Match the previous pinyin word frequency,Occasionally still get the correct result。I've been using this program for over two years,There have been many improvements in between,But most of the ambiguity splits are hard-coded in the form of manual processing.。
return to essence
I don't know if you found out,In my example,There is another rule,That is after the forward split,The pinyin at the end must be a final! either fang'an still is zhan'an Even gang'a (The ideal split should be " gan'ga ”),There are finals at the end (strictly speaking, it should be "no initials"),Because in this case, the initials must be taken away by the previous pinyin,It and the previous pinyin finals form another taste of the finals。so,We can completely match the longest,Determine whether this pinyin has an initial,If there is no,Take out the previous spelling,take its last letter and combine,If the result of such a combination is legal pinyin,It is legal to remove this letter from the previous pinyin,Then we found an ambiguous split combination,Just add them all to the list and you're good to go!
No need for a statistical language model at all (I actually tried,The effect is not much better than the pure transition probability,even slower),Judging by the rules。This completely solves this fixed-length ambiguity splitting problem,In subsequent full-sentence queries,We can directly feed these pinyin into the model,Let it find the candidate for the most appropriate context on its own。
Here is the actual Swfit code used in my engine,Direct copy paste does not work,because of missing related object declarations,But it can be understood as pseudo code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
class LESegment { class private func addPinyin2List(pinyin:String, py_list: inout [[String]]) { var handled = false if let p = PyString.pyCode(from: pinyin), PyString.isYunMu(p), let last_list = py_list.last { /* 由于是前缀查询,这里如果拼音是单纯的韵母,则有可能和前面拼音的最后一个字母结合成新的拼音,比如 fang'an,其中的 g 也可能分给后边的韵母形成 fan'gan 我们单独取前面拼音的末尾字母尝试和后面的韵母结合,看看前后是否还能是合法拼音,如果是,则这两种拆分都加入列表 这样后续茶叙的时候就可以进行 4 种组合的混合查询(虽然还有两种 fang'gan 和 fan'an, 但考虑到转移和词频,在整句中应该影响不大) (另外对于词语查询来说,影响应该也能接受) */ for py1 in last_list { let newPy = "\(py1.last!)\(pinyin)" guard PyString.isValidPinyin(for: newPy) else {continue} let oldPy = py1.subString(to: -1) guard PyString.isValidPinyin(for: oldPy) else {continue} py_list[py_list.count-1].append(oldPy) py_list.append([newPy, pinyin]) handled = true break } } if !handled { py_list.append([pinyin]) } } class private func segment(_ py: String, smartCorrection: Bool) -> [[String]] { guard !py.isEmpty else { return [] } var py_list: [[String]] = [] var last_index = 0 while true { for i in (last_index...min(py.count, last_index+6)).reversed() { let sub = py.subString(from: last_index, to: i + 1) if PyString.isValidPinyin(for: sub) { addPinyin2List(pinyin: sub, py_list: &py_list) last_index = i+1 if i == py.count { last_index = i } break } if i == last_index { // 如果走到了这里说明接下来的内容都不是合法拼音了,就直接加入列表并返回即可 py_list.append([py.subString(from: last_index)]) return py_list } } if last_index == py.count { break } } return py_list } } |
Variable length ambiguous split problem
The discussion above focuses on the issue of fixed-length ambiguous splitting,This is due to the limitations of the drop-off input method engine itself - cannot mix and handle variable-length pinyin strings,This is also a functional limitation based on Shuangpin development in the early years。So in dealing with variable length ambiguity,The approach I take is to combine these potentially insurgent words together when processing the thesaurus,such as query xian ,There will be words like "Xian",But there are also words like "Xi'an",they share the same code,But now the actual test does not seem to be ideal,such as user input xinlianwei ,will be split into xin'lian'wei ,This should be the "reason" of "psychological comfort",But it will not be higher than the "even" in any case... Regrettably, the drop-off input method cannot support simultaneous queries xin'lian'wei and xin'li'an'wei ,because a length is 3,Another length is 4。I haven't found a better solution here either,in the future,I'll be back to add content。
Original article written by LogStudio:R0uter's Blog » How to split the full pinyin of drop-off input method
Reproduced Please keep the source and description link:https://www.logcg.com/archives/3556.html