cmph It stands for C Minimal Perfect Hashing Library ,It is a very well-known, written in C minimal perfect hash library,What is a perfect hash?
Perfect hashing
Here we do not speak principle,You just need to know that the traditional hash conflict,We need to rely on a variety of algorithms to deal with conflict can be,For hash,Always need a table,This table reserved many locations,Then the calculated value is the coordinates of these locations,You can put data into the corresponding coordinate in。
But this time there is a problem,If the table is not big enough in advance,Or that your algorithm is not good enough,The so-called aggregation occurs,A value is always an area of conflict,The value of a certain area is empty,so,Your table is always greater than the number of key length to accommodate these spare。
But if your algorithm Niubi enough,Then you can do so that the table of length n can be equal to a minimum number of key!
When this reached,Then we call this hash functionMinimal perfect hash function。
cmph
So this kind of thing really exist? of course。cmph Is a,Of course we want to use cmph Or a little premise,For example, you do not have to repeat the key,The key number must be fixed static。
For example, I Hidden Markov Model transition matrix,To meet such a requirement。
but,cmph with C language,We need to put C bridgingTo Swift in。
bridging
And almost bridging oc, You downloaded cmph directory src directory of the source files onto the Swift project can (remember to choosecopyNot a reference),If this is the first drag content in other languages,Xcode will automatically prompt you to create a bridge file,Then you have a project called the Xxx.-Bridging-Header.h document,Where xxx is the name of your project。
Write in it #include "cmph.h" Enough,So you put cmph functions exposed to the Swift。
Consolidation Project
Direct import cmph source code can not be used directly in the Swift project in,If you empty project compiler,Encounter "Multiple main function" error,Go cmph source file,delete main.c bm_numbers.c bm_bumbers.h Now try to compile again,If you have an error,Then delete the source code to contain the main function,We only use the function。
Compile
cmph fact, libraries and tool kits - which is why the source code contains the main function,We go to the download directory of cmph,Run the following command to compile:
1 2 3 |
./configure make make install |
This is where you can execute commands in a terminal cmph Got it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
$cmph -h usage: cmph [-v] [-h] [-V] [-k nkeys] [-f hash_function] [-g [-c algorithm_dependent_value][-s seed] ] [-a algorithm] [-M memory_in_MB] [-b algorithm_dependent_value] [-t keys_per_bin] [-d tmp_dir] [-m file.mph] keysfile Minimum perfect hashing tool -h print this help message -c c value determines: * the number of vertices in the graph for the algorithms BMZ and CHM * the number of bits per key required in the FCH algorithm * the load factor in the CHD_PH algorithm -a algorithm - valid values are * bmz * bmz8 * chm * brz * fch * bdz * bdz_ph * chd_ph * chd -f hash function (may be used multiple times) - valid values are * jenkins -V print version number and exit -v increase verbosity (may be used multiple times) -k number of keys -g generation mode -s random seed -m minimum perfect hash function file -M main memory availability (in MB) used in BRZ algorithm -d temporary directory used in BRZ algorithm -b the meaning of this parameter depends on the algorithm selected in the -a option: * For BRZ it is used to make the maximal number of keys in a bucket lower than 256. In this case its value should be an integer in the range [64,175]. Default is 128. * For BDZ it is used to determine the size of some precomputed rank information and its value should be an integer in the range [3,10]. Default is 7. The larger is this value, the more compact are the resulting functions and the slower are them at evaluation time. * For CHD and CHD_PH it is used to set the average number of keys per bucket and its value should be an integer in the range [1,32]. Default is 4. The larger is this value, the slower is the construction of the functions. This parameter has no effect for other algorithms. -t set the number of keys per bin for a t-perfect hashing function. A t-perfect hash function allows at most t collisions in a given bin. This parameter applies only to the CHD and CHD_PH algorithms. Its value should be an integer in the range [1,128]. Defaul is 1 keysfile line separated file with keys |
Create a hash table
Use tool to create a hash table rather than code,Because it requires key is static,So be prepared in advance a list of key texts,Requirements are plain text documents,Line to a key,Maximum ten million also properly the completed。
Encoding is utf8
According to the command prompt,We executed directly in the directory cmph -g keys.txt .,If you want to specify the algorithm (specific algorithm can see the type and characteristicsOfficial website) For example, I am sure the database is used in the external memory,Do it with a special external memory optimization fast Algorithm,The command becomes so cmph -g -a fast keys.txt 。
Will be generated under the current directory after the implementation keys.txt.mph ,For convenience,I change it to keys.mph 。
verification
To see the corresponding results,We can come to a text file,The same format as the key list format,Written inside the key you want to query,For example, I named Keyser.txt ,This time we export query results with debug mode:
1 2 3 4 5 6 7 |
$cmph -v -m keys.mph keysq.txt 3237395114 -> 1025405 3237405114 -> 1287414 3237467114 -> 1613673 3237531114 -> 681367 3237926114 -> 1894601 3248668114 -> 221574 |
Get results
Now,We can use the same method to all the key search again,Each key can be obtained corresponding to the coordinates!
1 |
cmph -v -m keys.mph keys.txt > result.txt |
Results obtained,We will be able to generate a corresponding file according to the position,I used to own encoding compiled on the occasion of a binary database,But according to location cmph The coordinates correspond,Such data and position can do a correspondence,Query speedo(1).
The query Swift
Now,We can generate mph Swift project files into a backup - and of course the order of the corresponding database file to find the。
Swift using a function in C,It is quite difficult。
First, you need to have a type of non-variable expression of type pointer to keep the hash table object:
1 |
fileprivate var hashTable:OpaquePointer |
In the initialization Lane,We used to load the hash table c function:
1 2 |
guard let indexData = fopen(indexPath, "r")else { return nil } hashTable = cmph_load(indexData) |
So the hash table is loaded successfully。
Suppose we want to check the value of a UInt32 Digital coding,So we need to put it into cmph function accepted char* pointer,You can make use of NSString to fulfill:
although Swift developer documentationSwift said the String method automatically maps the NSString,But apparently,It does not。
1 2 3 |
let c = "\(code)" //转为字符串备用 let key = (c as NSString).utf8String //利用 NSString 转为 char* 的 UnsafePointer<Int8>指针 let id = cmph_search(self.hashTable, key , cmph_uint32(c.utf8.count)) //传参查询 |
Such a Int id variable is the type of coordinates,Return Value Type Swift compiler automatically help you map。
Summarize
cmph principle is very complex,But this does not affect our use of it。It is worth mentioning that,It uses LGPL Authorize,This means that you are free to use it。
Use Swift perfectly bridged C language,Swift does not in itself but because developers are encouraged to do so leads to a bunch of "unsafe" looked kind of thrilling feeling。
Original article written by LogStudio:R0uter's Blog » Swift in the use cmph
Reproduced Please keep the source and description link:https://www.logcg.com/archives/2350.html