什么是XML

主要参考了wikipedia，也包含了一些个人的经验。

“XML (Extensible Markup Language ) is a set of rules for encoding documents electronically.” Or "XML is a markup language for documents containing structured information."

Here the word "document" refers not only to traditional documents, like this one, but also to the myriad of other XML "data formats". These include vector graphics, e-commerce transactions, mathematical equations, object meta-data, server APIs, and a thousand other kinds of structured information. 基本上一切结构性的文档（之这里的广义上的文档）都可以使用XML来描述，这个相当伟大的，因为大家有了一个统一的文档格式，就像Unicode一样。

在实际中，xml的编码（encoding）是一个很麻烦的，尤其是当你处理不同来源的xml文档的时候，你需要很仔细的判断每个文档的编码，不然就会出现乱码了。在处理xml的时候我们通常使用parser ，后面再详述。

一些基本概念：

Markup and Content
The characters which make up an XML document are divided into markup and content. Markup and content may be distinguished by the application of simple syntactic rules. All strings which constitute markup either begin with the character "<" and end with a ">", or begin with the character "&" and end with a ";". Strings of characters which are not markup are content.

Tag
A markup construct that begins with "<" and ends with ">". Tags come in three flavors: start-tags, for example <section>, end-tags, for example </section>, and empty-element tags, for example <line-break/>.

Element
A logical component of a document which either begins with a start-tag and ends with a matching end-tag, or consists only of an empty-element tag. The characters between the start- and end-tags, if any, are the element's content, and may contain markup, including other elements, which are called child elements. An example of an element is <Greeting>Hello, world.</Greeting>. Another is <line-break/>.
Attribute
A markup construct consisting of a name/value pair that exists within a start-tag or empty-element tag. In this example, the name of the attribute is "number" and the value is "3": <step number="3">Connect A to B.</step> This element has two attributes, src and alt: <img src="madonna.jpg" alt='by Raphael'/> An element must not have two attributes with the same name.

对于一个xml文档，首先要求是well-formed ，也就是至少要满足：

It contains only properly-encoded legal Unicode characters.
None of the special syntax characters such as "<" and "&" appear except when performing their markup-delineation roles.
The begin, end, and empty-element tags which delimit the elements are correctly nested, with none missing and none overlapping.
The element tags are case-sensitive; the beginning and end tags must match exactly.
There is a single "root" element which contains all the other elements.

除了wll-formed，还有valid ，这个就是如果xml文档包含了schema，还需要验证文档是否符合schema的定义。简单来说可以理解schema为xml文档的定义，如什么元素可以出现，元素的类型，相互的关系如何，等等。以前主要是DTD(Document Type Definition)，现在XSD（XML Schema)了。这个完全可以另开一篇文章了，大家可以自己搜索一下。

最后说一下parser，这事处理xml文档的关键，可以理解为这是一个处理xml的API。如下：

Simple API for XML (SAX)，面向stream，采取事件驱动，速度快占有内存少；缺点是无法随机访问xml元素。
Pull Parser treats the document as a series of items which are read in sequence using the Iterator design pattern.如果使用PHP的话，一定知道SimpleXML了。
Document Object Model (DOM)，将整个xml作为一个树形结构载入内存，非常方便随即去访问任何元素，当然缺点是如果文档很大，需要不小的内存。（不要小看这点啊，我处理过一个上G的xml文档，机器内存才2G。）
Data binding，XML data is made available as a hierarchy of custom, strongly typed classes, in contrast to the generic objects created by a Document Object Model parser.这个方法我没有用过，不大清楚。

目前xml版本为1.0第五版，见参考【5】。

参考：

【1】A Technical Introduction to XML： http://www.xml.com/pub/a/98/10/guide0.html

【2】XML Namespaces by Example： http://www.xml.com/pub/a/1999/01/namespaces.html

【3】http://en.wikipedia.org/wiki/XML

【4】The Annotated XML Specification （XML规范注解）：http://www.xml.com/pub/a/axml/axmlintro.html

【5】Extensible Markup Language (XML) 1.0 (Fifth Edition)： http://www.w3.org/TR/REC-xml/