Oct 25, 2018

python 实现文本文件的编码检测和转换

记得在前公司做嵌入式开发时，总是会遇到因代码文件编码不对导致的注释乱码问题。因为部门里很多人喜欢使用 GB2312，而我比较倾向于用 utf-8 编码，所以每次合并他们代码的时候，通常我会使用一个 find 和 enca 的组合命令，对工程检索将 GB2312 的 .c 和 .h 文件转码为 utf-8，那么我们是不是可以自己编写一个 python 工具实现文件编码的检测和转换呢？当然可以，借助第三方库 chardet 就可以做到！

编码操作

要对文件编码转换前，必须也要知道源文件是什么编码，所以也得进行编码检测。

编码检测

这里是用 chardet 库的 UniversalDetector 类实现文件编码的检测，在使用前需要安装：

$ pip3 install chardet

比如针对编码为 utf-8 的文件 README.md，可以如下进行代码检测，检测的同时也会得到一个置信度表征检测正确的可信度：

from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
detector.reset()
for each in open(filepath, 'rb'):
    detector.feed(each)
    if detector.done:
        break
detector.close()
fileencoding = detector.result['encoding']
confidence = detector.result['confidence']

其中注意的是：

要以二进制读取模式打开文件
fileencoding 是编码，confidence 是置信度，取之范围 0～1

编码转换

文件编码的转换其实使用文件读写操作就可以完成，打开文件的时候可以指定编码，所以可以使用源文件编码形式打开读取内容，再以新的编码打开要生成的文件，将读取去的内容写入新的文件，就实现编码转换了。

f = open(sourcefile, 'r', encoding='gb2312', errors='replace')
text = f.read()
f.close()
f = open(targetfile, 'w', encoding='utf-8', errors='replace')
f.write(text)
f.close()

其中：

调用 open 的时候指定 error 参数为 replace，说明会将标记字符插入到错误位置
如果使用 with 语法的话可能会在这里出现一点小问题，如果输出文件路径与原文件路径一致会出现保存编码异常，所以没有使用 with

工具实现

这个 python 脚本工具的实现会用到内建库 argparse 库（点我了解基本用法）、os 库和第三方库 chardet 库。

准备脚本

新建文件 encoding: touch encoding
添加运行权限: chmod +x encoding
指明脚本解释器，文件中添加内容：
```
#!/usr/bin/env python3
```

导入必要库，文件中添加：

import os
import argparse
from chardet.universaldetector import UniversalDetector

添加命令参数解析函数

封装命令行参数解析操作为函数：

def getargs():
    parser = argparse.ArgumentParser(description='这是一个检测或转换文件编码的工具')
    parser.add_argument('files', nargs='+', help='指定一个或多个文件的路径')
    parser.add_argument('-e', '--encoding', nargs='?', const='UTF-8',
                        help='''指定目标编码，可选择的编码：
ASCII, (Default) UTF-8 (with or without a BOM), UTF-16 (with a BOM),
UTF-32 (with a BOM), Big5, GB2312/GB18030, EUC-TW, HZ-GB-2312, 
ISO-2022-CN, EUC-JP, SHIFT_JIS, ISO-2022-JP, ISO-2022-KR, KOI8-R, 
MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251, ISO-8859-2, 
windows-1250, EUC-KR, ISO-8859-5, windows-1251, ISO-8859-1, windows-1252, 
ISO-8859-7, windows-1253, ISO-8859-8, windows-1255, TIS-620''')
    parser.add_argument('-o', '--outdir', help='指定转换文件的输出目录')
    return parser.parse_args()

参数	值	描述
files	文件路径	指定一个或多个文件的路径
-e, –encoding	编码名称, 默认为 `UTF-8`	指定目标编码
-o, –outdir	目录路径	指定转换文件的输出目录

封装编码检测函数

为了方便处理，根据编码检测小节，封装编码检测函数如下：

def detcect_encoding(filepath):
    """检测文件编码
    Args:
        detector: UniversalDetector 对象
        filepath: 文件路径
    Return:
        fileencoding: 文件编码
        confidence: 检测结果的置信度，百分比
    """
    detector = UniversalDetector()
    detector.reset()
    for each in open(filepath, 'rb'):
        detector.feed(each)
        if detector.done:
            break
    detector.close()
    fileencoding = detector.result['encoding']
    confidence = detector.result['confidence']
    if fileencoding is None:
        fileencoding = 'unknown'
        confidence = 0.99
    return fileencoding, confidence * 100

UniversalDetector 实例对象完成检测后，如果不能识别编码，detector.result['encoding'] 就会是 None ，所以这里做了一定处理，并且最后返回的置信度这个值取的是百分比的值。

最终实现流程

在 __main__ 中实现整个功能如下：

if __name__ == '__main__':
    args = getargs()
    if args.outdir:
        if not os.path.exists(args.outdir):
            answer = input(f'[-] 无效的导出路径: {args.outdir} [-]\n要用转码后的文件直接替换源文件吗? y or n\n')
            if 'y' in answer:
                args.outdir = None
            else:
                exit(1)
    for file in args.files:
        if not os.path.isfile(file):
            print(f'[-] 无效的文件路径: {file} [-]')
            continue
        encoding, confidence = detcect_encoding(file)
        print(f'[+] {file}: 编码 -> {encoding} (置信度 {confidence}%) [+]')
        if args.encoding and (encoding is not 'unknown') and (confidence > 0.75):
            if args.encoding == encoding:
                print(f'[*] {file} 已经是 {encoding} 编码了，无需转换！[*]')
                continue
            f = open(file, 'r', encoding=encoding, errors='replace')
            text = f.read()
            f.close()
            outpath = os.path.join(args.outdir, file) if args.outdir else file
            f = open(outpath, 'w', encoding=args.encoding, errors='replace')
            f.write(text)
            f.close()
            print(f'[+] 转码成功: {file}({encoding}) -> {outpath}({args.encoding}) [+]')

获取命令行参数
如果命令行中指定了转换文件输出目录，检查目录是否有效
枚举指定的文件列表
针对文件，首先检查路径的有效性，然后通过 detect_encoding 方法获取编码检测结果
如果指定了目标编码并且检测的文件编码不是未知和置信度高于 75% 那就进行编码转换
在编码转换中，检测转换的必要性
如果需要转换那就按照小节编码转换进行编码的转换

最终实现完整的工具代码，可参考：tools-with-script/encoding/encoding

测试

实现了工具，必须测试一下，这里用到了文件 gb2312.txt 和文件 big5.txt：

$ ./encoding -h
usage: encoding [-h] [-e [ENCODING]] [-o OUTDIR] files [files ...]

这是一个检测或转换文件编码的工具

positional arguments:
  files                 指定一个或多个文件的路径

optional arguments:
  -h, --help            show this help message and exit
  -e [ENCODING], --encoding [ENCODING]
                        指定目标编码，可选择的编码： ASCII, (Default) UTF-8 (with or without
                        a BOM), UTF-16 (with a BOM), UTF-32 (with a BOM),
                        Big5, GB2312/GB18030, EUC-TW, HZ-GB-2312, ISO-2022-CN,
                        EUC-JP, SHIFT_JIS, ISO-2022-JP, ISO-2022-KR, KOI8-R,
                        MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251,
                        ISO-8859-2, windows-1250, EUC-KR, ISO-8859-5,
                        windows-1251, ISO-8859-1, windows-1252, ISO-8859-7,
                        windows-1253, ISO-8859-8, windows-1255, TIS-620
  -o OUTDIR, --outdir OUTDIR
                        指定转换文件的输出目录
$ ./encoding big5.txt gb2312.txt
[+] big5.txt: 编码 -> Big5 (置信度 99.0%) [+]
[+] gb2312.txt: 编码 -> GB2312 (置信度 99.0%) [+]
$ ./encoding big5.txt gb2312.txt -e utf-8
[+] big5.txt: 编码 -> Big5 (置信度 99.0%) [+]
[+] 转码成功: big5.txt(Big5) -> big5.txt(utf-8) [+]
[+] gb2312.txt: 编码 -> GB2312 (置信度 99.0%) [+]
[+] 转码成功: gb2312.txt(GB2312) -> gb2312.txt(utf-8) [+]
$ ✗ ./encoding big5.txt gb2312.txt
[+] big5.txt: 编码 -> utf-8 (置信度 99.0%) [+]
[+] gb2312.txt: 编码 -> utf-8 (置信度 99.0%) [+]