R0uter's Blog

Log everything.

Kotlin/Android detect text encoding

2024 year 11 moon 2 day

Recent changes：2nd November, 2024

Recently doing Android version of drop-box input method，I had a problem importing the code table。Because the code table of the drop-box input method supports two encoding formats: utf8 and gb18030.，Even my own built-in code table uses a mixture of these two formats.。In Swift or Python，If you use the wrong encoding to decode text，You will receive an error。Use this method，I can easily detect both encodings - first decode with utf8，If an error is reported, try again gb18030. easy and convenient，enough for me。

But this situation is Kotlin It’s not working! in Java，If you decode text using wrong encoding，It will just give you a bunch of gibberish! This would normally require explicitly forcing decoding in other languages... but is the default in Java，And there is no way to make it report an error。

1

2

3

4

5

6

7

8

9

try {

val assetName = ""

val inputStream = context.assets.open(assetName)

reader = InputStreamReader(inputStream, Charsets.UTF_8) // 任何编码的文本都不会触发报错

} catch (e: IOException) {

// means file not in the asset, so we try load it from the file system

} catch (e: Exception) {

return importFalse(context, "码表编码格式错误，请确定编码为 uft-8 不要有 BOM!")

}

it's hard to do，After query，I found a third-party library that claims to be able to detect various encodings：https://github.com/albfernandez/juniversalchardet But unfortunately，it does not successfully detect gb18030.

At last，As expected, questions in Chinese have to be searched in Chinese，http://www.meilongkui.com/archives/473 I found such an article，It is said that the calling method in the article can trigger an error report。I will read a short paragraph first according to the plan in the article.，Then the conversion... ...succeeded! But there is also a small problem，Since the read buffer length is truncated，So it is possible that even utf8 the text of，Nor can it be solved successfully using utf8.，Because you may happen to buffer in the middle of a character encoding。

Final，My solution was to read it all in one go，You won't go wrong this way。

of course，The price is that you need to read everything into memory at once。Not a big problem for me，Because the input method code table is usually not very large。if necessary，In fact, you can also slightly offset the size or position of the buffer and try again.。

Final，Code like this：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

val buffer: ByteArray = inputStream.readAllBytes()

inputStream.close()

var content: String? = null

try {

content = Charsets.UTF_8.newDecoder().decode(ByteBuffer.wrap(buffer)).toString()

} catch (_: Exception) {

// do nothing try next one

}

try {

content = Charsets.GB18030.newDecoder().decode(ByteBuffer.wrap(buffer)).toString()

} catch (_: Exception) {

// do nothing try next one

}

if (content == null) {

return importFalse(context, "码表编码格式错误，请确定编码为 uft-8 不要有 BOM!")

}

Original article written by LogStudio：R0uter's Blog » Kotlin/Android detect text encoding

Reproduced Please keep the source and description link：https://www.logcg.com/archives/3868.html

Updated: 2024 year 11 moon 2 day at pm 4:54

Tags: gb18030, utf8

About the Author

R0uter

The non-declaration，I have written articles are original，Reproduced, please indicate the link on this page and my name。

Kotlin/Android detect text encoding

related articles：

About the Author

R0uter

Leave a Reply Cancel reply