在 Swift 中使用 cmph

最近更新：4th 9 月, 2022

cmph 的全称是 C Minimal Perfect Hashing Library ，是一个很著名的用 C 写成的最小完美哈希库，什么是完美哈希？

完美哈希

这里我们不讲原理，你只需要知道传统的哈希有冲突，我们需要靠各种算法来处理冲突就可以了，对于哈希，总是需要一个表，这个表里预留了很多位置，然后计算出来的值就是这些位置的坐标，你可以把对应的数据放到坐标里。

但这时候有一个问题，如果这个预先的表不够大，或者说你的算法不够好，就会发生所谓的聚集，某一片区域值总是冲突，而某一片区域的值则是空的，所以，你的表长度总是要大于 key 的数量来容纳这些空余。

但如果你的算法足够牛逼，那么你就能做到让表的长度n最少能够等于key的数量！

当这种情况达成，那么我们就称这个哈希函数为最小完美哈希函数。

cmph

那么这种东西真的存在吗？当然。cmph 就是一个，当然想要使用 cmph 还是有一点前提的，比如你的 key 得是不重复的，key 的数量得是固定的静态的。

比如说我的隐马尔可夫模型中的转移矩阵，就满足这么一个需求。

不过，cmph 是用 C 语言实现的，我们需要把 C 桥接到 Swift 中。

桥接

和桥接 oc 差不多，你把下载好的cmph目录中的 src 目录里的源文件都拖到 Swift 项目中即可（记得选择拷贝而不是引用），如果是第一次拖入其他语言的内容，Xcode 会自动提示你创建桥接文件，然后你的项目中会有一个叫做 xxx-Bridging-Header.h 的文件，其中的xxx就是你的项目名字。

在里边写 #include "cmph.h" 就好了，这样你就把 cmph 的函数暴露给了 Swift。

整理项目

直接导入的 cmph 源代码不能直接在 Swift 项目里使用，如果你空项目编译，会遇到“多个 main 函数”的错误，去 cmph 源文件中，删除 main.c bm_numbers.c bm_bumbers.h 现在再编译试试，如果还报错，那么就删除那个包含 main 函数的源码即可，我们只用函数。

编译

cmph 其实是库和工具套装——这也是为什么源码里包含了 main 函数，我们进入下载的 cmph 目录里，执行如下命令进行编译：

./configure

make

make install

这下你就可以在终端执行命令 cmph 了：

$cmph -h

usage: cmph [-v] [-h] [-V] [-k nkeys] [-f hash_function] [-g [-c algorithm_dependent_value][-s seed] ] [-a algorithm] [-M memory_in_MB] [-b algorithm_dependent_value] [-t keys_per_bin] [-d tmp_dir] [-m file.mph] keysfile

Minimum perfect hashing tool

-h print this help message

-c c value determines:

* the number of vertices in the graph for the algorithms BMZ and CHM

* the number of bits per key required in the FCH algorithm

* the load factor in the CHD_PH algorithm

-a algorithm - valid values are

* bmz

* bmz8

* chm

* brz

* fch

* bdz

* bdz_ph

* chd_ph

* chd

-f hash function (may be used multiple times) - valid values are

* jenkins

-V print version number and exit

-v increase verbosity (may be used multiple times)

-k number of keys

-g generation mode

-s random seed

-m minimum perfect hash function file

-M main memory availability (in MB) used in BRZ algorithm

-d temporary directory used in BRZ algorithm

-b the meaning of this parameter depends on the algorithm selected in the -a option:

* For BRZ it is used to make the maximal number of keys in a bucket lower than 256.

In this case its value should be an integer in the range [64,175]. Default is 128.

* For BDZ it is used to determine the size of some precomputed rank

information and its value should be an integer in the range [3,10]. Default

is 7. The larger is this value, the more compact are the resulting functions

and the slower are them at evaluation time.

* For CHD and CHD_PH it is used to set the average number of keys per bucket

and its value should be an integer in the range [1,32]. Default is 4. The

larger is this value, the slower is the construction of the functions.

This parameter has no effect for other algorithms.

-t set the number of keys per bin for a t-perfect hashing function. A t-perfect

hash function allows at most t collisions in a given bin. This parameter applies

only to the CHD and CHD_PH algorithms. Its value should be an integer in the

range [1,128]. Defaul is 1

keysfile line separated file with keys

创建哈希表

使用工具创建哈希表而不是代码，因为它需要key是静态的，所以要事先准备key的文本列表，要求是纯文本文档，一行一个 key 即可，最大千万也妥妥的。

编码是utf8

根据命令提示，我们直接在目录里执行 cmph -g keys.txt 即可，如果你想要指定算法（具体的算法种类和特性可以见官网）比如我的数据库肯定是用在外存的，就用专门对外存做优化的 brz 算法，那么命令就变成了这样 cmph -g -a brz keys.txt 。

执行完毕后会在当前目录下生成 keys.txt.mph ，为了方便，我把它改为 keys.mph 。

验证

要看到对应的结果，我们可以再来一个文本文件，格式与 key 的列表的格式相同，里边写入你想要查询的 key，比如我命名为 keysq.txt ，这时候我们用 debug 模式输出查询结果：

$cmph -v -m keys.mph keysq.txt

3237395114 -> 1025405

3237405114 -> 1287414

3237467114 -> 1613673

3237531114 -> 681367

3237926114 -> 1894601

3248668114 -> 221574

获得结果

现在，我们可以用同样的方法把所有的 key 查一遍，就能得到每一个 key 对应的坐标了！

1	cmph -v -m keys.mph keys.txt > result.txt

获得了结果，我们就可以根据位置来生成一个对应的文件，我用自己的编码方式编译了一个之际的二进制数据库，但位置根据 cmph 的坐标来对应，这样数据和位置就可以做到一一对应了，查询的速度是o(1).

在 Swift 中查询

现在，我们可以把生成的 mph 文件导入 Swift 项目备用了——当然还有对应次序的数据库文件以便查找。

在 Swift 中使用 C 的函数，还是挺有难度的。

首先你要有一个类型为不可表达指针类型的变量来存哈希表对象：

1	fileprivate var hashTable:OpaquePointer

在初始化器里，我们来用 c 函数加载哈希表：

1 2	guard let indexData = fopen(indexPath, "r")else { return nil } hashTable = cmph_load(indexData)

这样哈希表就加载成功了。

假定我们要查的值是一个 UInt32 的数字编码，那么我们需要先把它转为 cmph 函数所接受的 char* 指针，那么可以借助 NSString 来实现：

虽然 Swift 开发者文档中说 Swift 的 String 自动映射了 NSString 的方法，但显然，它并没有。

let c = "\(code)" //转为字符串备用

let key = (c as NSString).utf8String //利用 NSString 转为 char* 的 UnsafePointer<Int8>指针

let id = cmph_search(self.hashTable, key , cmph_uint32(c.utf8.count)) //传参查询

这样id变量就是一个Int类型的坐标了，返回值的类型 Swift 编译器自动帮你映射了。

总结

cmph 的原理很复杂，但这并不影响我们使用它。值得一提的是，它使用 LGPL 授权，也就是说你可以自由地使用它。

使用 Swift 可以完美地桥接 C 语言，但由于 Swift 本身并不鼓励开发者这么做导致一堆“unsafe”看着有种惊心动魄的感觉。

本文由落格博客原创撰写：落格博客 » 在 Swift 中使用 cmph

转载请保留出处和原文链接：https://www.logcg.com/archives/2350.html

落格博客

在 Swift 中使用 cmph

完美哈希

cmph

桥接

整理项目

编译

创建哈希表

验证

获得结果

在 Swift 中查询

总结

About the Author

R0uter

发表回复取消回复

在 Swift 中使用 cmph

完美哈希

cmph

桥接

整理项目

编译

创建哈希表

验证

获得结果

在 Swift 中查询

总结

相关文章：

About the Author

R0uter

发表回复 取消回复

发表回复取消回复