python练习三—解析xml

使用python解析xml，主要使用sax的ContentHandler中的标签开始和标签结束的方法驱动，然后在开始（或者结束）事件中决定使用什么处理方法，使用dispatcher来决定并分发到指定方法内处理处理流程如下：

初始化的时候创建一个目录list
遇到page在当前目录下新建一个html文件，标志接下来的标签是要使用default处理，写到html页面中
遇到page内部的标签，使用default处理
遇到page结束标签，该html写完，填充结尾标签，关闭流
遇到directory标签，往directory的list】中添加下以及目录名称
遇到page重复第1,2,3步
解析完成

代码如下

#! /usr/bin/env python
# -*- coding=utf-8 -*-

import os
import sys
from xml.sax.handler import ContentHandler
from xml.sax import parse

class Dispatcher:
    '''
    根据具体的xml标签分发到具体的解析函数解析
    '''

    def dispatch(self, prefix, name, attrs=None):
        mname = prefix + name.capitalize()
        dname = 'default' + prefix.capitalize()
        method = getattr(self, mname, None)
        if callable(method):
            # 如果有该标签的处理方法则使用该方法处理，并初始化需要传递的参数
            args = ()
        else:
            # 如果没有该标签的处理方法，则采用默认的处理方法，将标签内的内容作为正文
            method = getattr(self, dname, None)
            # 默认处理函数需要传递标签名称
            args = name,
        # 如果是调用开始处理函数， 需要传递该标签的属性
        if prefix == 'start':
            args += attrs,
        if callable(method):
            method(*args)

    # 重载父类的startElement方法
    def startElement(self, name, attrs):
        self.dispatch('start', name, attrs)

    # 重载父类的endElement方法
    def endElement(self, name):
        self.dispatch('end', name)

class WebsiteConstructor(Dispatcher, ContentHandler):
    '''
    分析website.xml构建html网页
    '''

    # 该标签是否是正文，是否被page包裹，是否需要解析
    passthrough = False

    def __init__(self, directory):
        self.directory = [directory]
        self.ensureDirectory()

    def ensureDirectory(self):
        path = os.path.join(*self.directory)
        if not os.path.isdir(path):
            os.makedirs(path)

    def characters(self, chars):
       if self.passthrough:
           self.out.write(chars)
    def defaultStart(self, name, attrs):
        if self.passthrough:
            self.out.write('<' + name)
            print '-----'
            print attrs
            for key, val in attrs.items():
                self.out.write(' %s=%s' % (key, val))
                print key,val
            self.out.write('>')
    def defaultEnd(self, name):
        if self.passthrough:
            self.out.write('</%s>' % name)

    def startDirectory(self, attrs):
        self.directory.append(attrs['name'])
        self.ensureDirectory()

    def endDirectory(self):
        self.directory.pop()


    def startPage(self, attrs):
        filename = os.path.join(*self.directory + [attrs['name'] + '.html'])
        self.out = open(filename, 'w')
        print os.path
        print filename
        print self.directory
        print self.directory + [attrs['name'] + '.html']
        self.writeHeader(attrs['title'])
        self.passthrough = True

    def endPage(self):
        self.passthrough = False
        self.writeFooter()
        self.out.close()


    def writeHeader(self, title):
        self.out.write('<html>
    <head>
    <title>')
        self.out.write(title)
        self.out.write('</title>
  </head>
   <body>
')

    def writeFooter(self):
        self.out.write('
  </body>
</html>
')



# 执行程序
parse('website.xml', WebsiteConstructor('public_html'))

参照书中写完代码之后，发现"args += attrs"一直报错，调试的时候发现

TypeError: coercing to Unicode: need string or buffer, instance found

在"+="运算符连边需要的是string或者buffer，实际上是instance，在这条语句里面attrs是AttributesImpl的对象，前面"args = name"将name赋值给args，args是string，所以attrs类型不对，一开始还以为是ContentHandler API改变了（因为该书已经比较早了），在ipython交互命令行中查看帮助文档，发现没有错，折腾调试了挺久，对照书中源码发现有两处少了","，当时已经疯了。。。。。

args = name,
args += attrs,

少了上面两个逗号，内心的崩溃啊，由此可见自己python基本知识还是漏了

args = name # args仅仅是一个对象，并不是元组
args = name, # args是一个元组

一个逗号引发的血案。。。。。

还有就是"+="，该运算符两边类型必须一致，都是字符串（列表，元组）

完整代码

http://pan.baidu.com/s/1pLqxLKv