在 Swift 中使用 cmph

最近更改：4月9日, 2022

CMPH 的全稱是 ç最小完美哈希庫 ，是一個很著名的用 C 寫成的最小完美哈希庫，什麼是完美哈希？

完美哈希

這裡我們不講原理，你只需要知道傳統的哈希有衝突，我們需要靠各種算法來處理衝突就可以了，對於哈希，總是需要一個表，這個表裡預留了很多位置，然後計算出來的值就是這些位置的坐標，你可以把對應的數據放到坐標裡。

但這時候有一個問題，如果這個預先的表不夠大，或者說你的算法不夠好，就會發生所謂的聚集，某一片區域值總是衝突，而某一片區域的值則是空的，所以，你的表長度總是要大於 key 的數量來容納這些空餘。

但如果你的算法足夠牛逼，那麼你就能做到讓表的長度n最少能夠等於key的數量！

當這種情況達成，那麼我們就稱這個哈希函數為最小完美哈希函數。

CMPH

那麼這種東西真的存在嗎？當然。CMPH 就是一個，當然想要使用 CMPH 還是有一點前提的，比如你的 key 得是不重複的，key 的數量得是固定的靜態的。

比如說我的隱馬爾可夫模型中的轉移矩陣，就滿足這麼一個需求。

不過，cmph 是用 C 語言實現的，我們需要把 C 橋接到 Swift 中。

橋接

和橋接 oc 差不多，你把下載好的cmph目錄中的 src 目錄裡的源文件都拖到 Swift 項目中即可（記得選擇拷貝而不是引用），如果是第一次拖入其他語言的內容，Xcode 會自動提示你創建橋接文件，然後你的項目中會有一個叫做 XXX-橋接-頭.H 的文件，其中的xxx就是你的項目名字。

在裡邊寫 #include "cmph.h" 就好了，這樣你就把 cmph 的函數暴露給了 Swift。

整理項目

直接導入的 cmph 源代碼不能直接在 Swift 項目裡使用，如果你空項目編譯，會遇到“多個 main 函數”的錯誤，去 cmph 源文件中，刪除主要.C bm_numbers.C bm_bumbers.H 現在再編譯試試，如果還報錯，那麼就刪除那個包含 main 函數的源碼即可，我們只用函數。

編譯

cmph 其實是庫和工具套裝——這也是為什麼源碼裡包含了 main 函數，我們進入下載的 cmph 目錄裡，執行如下命令進行編譯：

./configure

make

make install

這下你就可以在終端執行命令 CMPH 了：

$cmph -h

usage: cmph [-v] [-h] [-V] [-k nkeys] [-f hash_function] [-g [-c algorithm_dependent_value][-s seed] ] [-a algorithm] [-M memory_in_MB] [-b algorithm_dependent_value] [-t keys_per_bin] [-d tmp_dir] [-m file.mph] keysfile

Minimum perfect hashing tool

-h print this help message

-c c value determines:

* the number of vertices in the graph for the algorithms BMZ and CHM

* the number of bits per key required in the FCH algorithm

* the load factor in the CHD_PH algorithm

-a algorithm - valid values are

* bmz

* bmz8

* chm

* brz

* fch

* bdz

* bdz_ph

* chd_ph

* chd

-f hash function (may be used multiple times) - valid values are

* jenkins

-V print version number and exit

-v increase verbosity (may be used multiple times)

-k number of keys

-g generation mode

-s random seed

-m minimum perfect hash function file

-M main memory availability (in MB) used in BRZ algorithm

-d temporary directory used in BRZ algorithm

-b the meaning of this parameter depends on the algorithm selected in the -a option:

* For BRZ it is used to make the maximal number of keys in a bucket lower than 256.

In this case its value should be an integer in the range [64,175]. Default is 128.

* For BDZ it is used to determine the size of some precomputed rank

information and its value should be an integer in the range [3,10]. Default

is 7. The larger is this value, the more compact are the resulting functions

and the slower are them at evaluation time.

* For CHD and CHD_PH it is used to set the average number of keys per bucket

and its value should be an integer in the range [1,32]. Default is 4. The

larger is this value, the slower is the construction of the functions.

This parameter has no effect for other algorithms.

-t set the number of keys per bin for a t-perfect hashing function. A t-perfect

hash function allows at most t collisions in a given bin. This parameter applies

only to the CHD and CHD_PH algorithms. Its value should be an integer in the

range [1,128]. Defaul is 1

keysfile line separated file with keys

創建哈希表

使用工具創建哈希表而不是代碼，因為它需要key是靜態的，所以要事先準備key的文本列表，要求是純文本文檔，一行一個 key 即可，最大千萬也妥妥的。

編碼是utf8

根據命令提示，我們直接在目錄裡執行 CMPH -g 按鍵.TXT 即可，如果你想要指定算法（具體的算法種類和特性可以見官網）比如我的數據庫肯定是用在外存的，就用專門對外存做優化的快算法，那麼命令就變成了這樣 CMPH -g -一個快按鍵.TXT 。

執行完畢後會在當前目錄下生成按鍵.TXT.英里，為了方便，我把它改為按鍵.英里。

驗證

要看到對應的結果，我們可以再來一個文本文件，格式與 key 的列表的格式相同，裡邊寫入你想要查詢的 key，比如我命名為 keysq.TXT ，這時候我們用 debug 模式輸出查詢結果：

$cmph -v -m keys.mph keysq.txt

3237395114 -> 1025405

3237405114 -> 1287414

3237467114 -> 1613673

3237531114 -> 681367

3237926114 -> 1894601

3248668114 -> 221574

獲得結果

現在，我們可以用同樣的方法把所有的 key 查一遍，就能得到每一個 key 對應的坐標了！

1	cmph -v -m keys.mph keys.txt > result.txt

獲得了結果，我們就可以根據位置來生成一個對應的文件，我用自己的編碼方式編譯了一個之際的二進制數據庫，但位置根據 CMPH 的坐標來對應，這樣數據和位置就可以做到一一對應了，查詢的速度是o(1).

在 Swift 中查詢

現在，我們可以把生成的 mph 文件導入 Swift 項目備用了——當然還有對應次序的數據庫文件以便查找。

在 Swift 中使用 C 的函數，還是挺有難度的。

首先你要有一個類型為不可表達指針類型的變量來存哈希表對象：

1	fileprivate var hashTable:OpaquePointer

在初始化器裡，我們來用 c 函數加載哈希表：

1 2	guard let indexData = fopen(indexPath, "r")else { return nil } hashTable = cmph_load(indexData)

這樣哈希表就加載成功了。

假定我們要查的值是一個 UInt32的的數字編碼，那麼我們需要先把它轉為 cmph 函數所接受的燒焦* 指標，那麼可以藉助的NSString 來實現：

雖然 Swift 開發者文檔中說 Swift 的 String 自動映射了 NSString 的方法，但顯然，它並沒有。

let c = "\(code)" //转为字符串备用

let key = (c as NSString).utf8String //利用 NSString 转为 char* 的 UnsafePointer<Int8>指针

let id = cmph_search(self.hashTable, key , cmph_uint32(c.utf8.count)) //传参查询

這樣id變量就是一個Int類型的坐標了，返回值的類型 Swift 編譯器自動幫你映射了。

總結

cmph 的原理很複雜，但這並不影響我們使用它。值得一提的是，它使用 LGPL 授權，也就是說你可以自由地使用它。

使用 Swift 可以完美地橋接 C 語言，但由於 Swift 本身並不鼓勵開發者這麼做導致一堆“unsafe”看著有種驚心動魄的感覺。

本文由落格博客原創撰寫：落格博客 » 在 Swift 中使用 cmph

轉載請保留出處和原文鏈接：https://www.logcg.com/archives/2350.html

落格博客

在 Swift 中使用 cmph

完美哈希

CMPH

橋接

整理項目

編譯

創建哈希表

驗證

獲得結果

在 Swift 中查詢

總結

關於作者

R0uter

發表評論取消回复

在 Swift 中使用 cmph

完美哈希

CMPH

橋接

整理項目

編譯

創建哈希表

驗證

獲得結果

在 Swift 中查詢

總結

相關文章：

關於作者

R0uter

發表評論 取消回复

發表評論取消回复