用Python处理xml文件中的非法字符

用xml.dom.minidom.parse()解析xml文件时遇到非法字符直接报错的问题
最后的方案是把纯文本方式读入文件，然后用字符串来处理
可以得到将非法字符全部剔除的结果

[python]
#!/usr/bin/python
# -*- coding:utf-8 -*-

import string
import xml.dom.minidom

def parse_xml(file_path):
“””
Handle xml file with invalid character
[input] : path of the xml file
[output] : xml.dom.minidom.Document instance
“””
try:
xmldoc = xml.dom.minidom.parse(file_path)
except:
f = file(file_path)
s = f.read()
f.close()

ss = s.translate(None, string.printable)
s = s.translate(None, ss)

xmldoc = xml.dom.minidom.parseString(s)
return xmldoc

if __name__ == ‘__main__’:
pass
[/python]

P.S. 如果有更好的解决方案，欢迎交流

October 13, 2011

Xuan Hu

Just about Code

Python, XML

5 responses to “用Python处理xml文件中的非法字符”

lyxint says:

10/13/2011 at 19:28

我以前写过一个中文数字,全角数字转换为阿拉伯数字的脚本, 用法类似

http://lyxint.com/archives/58

Reply
crifan says:

12/19/2011 at 19:40

之前用过str_xml_safe = saxutils.escape(str_to_process)，处理后，就可以输出为xml了。也许你用得上这个。

Reply
Shane.Hu says:

12/20/2011 at 11:56

@crifan 我试了一下并且看了官方的doc，发现escape只能排除特定的字符，但是我遇到的情况是从PDF文档中解析内容时遇到的非法字符，是不可见的字符，所以貌似不太好派上用场，不过还是很感谢你提供的信息

Reply
Yachen Liu says:

12/27/2011 at 17:58

XML的非法字符定义不是这样的吧，官方定义是：
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

http://www.w3.org/TR/REC-xml/#charsets

Reply

Classical · Cycling · Code

用Python处理xml文件中的非法字符

5 responses to “用Python处理xml文件中的非法字符”

Leave a Reply to Yachen Liu Cancel reply