PHP 中的 SimpleXML 处理

PHP 版本 5 引入了 SimpleXML，一种用于读写 XML 的新的应用程序编程接口（API）。在 SimpleXML 中，下面的这样的表达式：

$doc->rss->channel->item->title

从文档中选择元素。只要熟悉文档的结构，很容易编写这种表达式。但是，如果不很清楚需要的元素出现在何处（比如 Docbook、HTML 和类似的叙述性文档中），SimpleXML 可以使用 XPath 表达式寻找这些元素。

开始使用 SimpleXML

假设需要一个 PHP 页面将 RSS 提要（feed）转化成 HTML。RSS 是一种简单的 XML 格式用于发布连锁内容。文档的根元素是rss，它包括一个 channel 元素。channel 元素包含关于提要的元数据，如标题、语言和 URL。它还包含各种封装在 item 元素中的报道。每个 item 都有一个 link 元素，包括一个 URL，还有 title 或 description（通常两者都有），包含普通文本。不使用名称空间。RSS 的内容当然不止这些，不过对本文来说知道这些就足够了。清单 1 显示了一个典型的例子，它包含两个新闻项。

清单 1. RSS 提要

 
<?xml version="1.0" encoding="UTF-8"?>
<rss version="0.92">
<channel>
  <title>Mokka mit Schlag</title>
  <link>http://www.elharo.com/blog</link>
  <language>en</language>
  <item>
    <title>Penn Station: Gone but not Forgotten</title>
    <description>
     The old Penn Station in New York was torn down before I was born. 
     Looking at these pictures, that feels like a mistake.  The current site is 
     functional, but no more; really just some office towers and underground 
     corridors of no particular interest or beauty. The new Madison Square...
    </description>
    <link>http://www.elharo.com/blog/new-york/2006/07/31/penn-station</link>
  </item>
  <item>
    <title>Personal for Elliotte Harold</title>
    <description>Some people use very obnoxious spam filters that require you 
     to type some random string in your subject such as E37T to get through. 
     Needless to say neither I nor most other people bother to communicate with 
     these paranoids. They are grossly overreacting to the spam problem. 
     Personally I won't ...</description>

    <link>http://www.elharo.com/blog/tech/2006/07/28/personal-for-elliotte-harold/</link>
  </item>
</channel>
</rss>

我们来开发一个 PHP 页面将 RSS 提要格式化为 HTML。清单 2 显示了这个页面的基本结构。

清单 2. PHP 代码的静态结构

 
<?php // Load and parse the XML document ?>
<html xml:lang="en" lang="en">
<head>
  <title><?php // The title will be read from the RSS ?></title>
</head>
<body>

<h1><?php // The title will be read from the RSS again ?></h1>

<?php
// Here we'll put a loop to include each item's title and description
?>

</body>
</html>

解析 XML 文档

第一步是解析 XML 文档并保存到变量中。只需要一行代码，向 simplexml_load_file() 函数传递一个 URL 即可：

$rss =  simplexml_load_file('http://partners.userland.com/nytRss/nytHomepage.xml');

警告

这里选择的方案绝不是最佳方案。实际上不应该每次单击页面时都加载和解析 RSS 提要。对于该页面的读者来说这样做太慢，而且可能造成所加载 RSS 提要的拒绝服务，多数 RSS 都规定了适当的每小时最大的刷新次数。真正的解决方案应该缓冲生成的 HTML 页面、RSS 提要或两者。但是，我们重点是使用 SimpleXML 库，因此这里没有过多考虑。

对于这个例子，我已经从 Userland 的 New York Times 提要（在 http://partners.userland.com/nytRss/nytHomepage.xml）填充了页面。当然，也可使用其他 RSS 提要的任何 URL。

要注意，虽然名称为 simplexml_load_file()，该函数实际上解析远程 HTTP URL 上的 XML 文档。但这并不是该函数唯一令人感到奇怪的地方。返回值（这里存储在 $rss 变量中）并没有指向整个文档，如果使用过其他 API 如文档对象模型（DOM）您可能会这样期望。相反，它指向文档的根元素。从 SimpleXML 不能访问文档序言和结语部分的内容。

寻找提要标题

整个提要的标题（不是提要中各报道的标题）位于 rss 根元素 channel 的 title 孩子中。很容易找到这个标题，就仿佛 XML 文档是类 rss 的一个对象的序列化形式，它的 channel 字段本身带有一个 title 字段。使用常规 PHP 对象引用语法，寻找标题的语句如下：

$title =  $rss->channel->title;

找到之后可以将其添加到输出 HTML 中。这样做很简单，只要回显 $title 变量即可：

<title><?php echo $title; ?></title>

这一行输出元素的字符串值而不是整个元素。就是说写入文本内容但不包括标签。

甚至可以完全跳过中间变量 $title：

<title><?php echo $rss->channel->title; ?></title>

因为该页面在多处重用这个值，我发现用一个含义明确的变量来存储会更方便。

迭代新闻项

然后必须发现提要中的项。完成这项任务的表达式很简单：

$rss->channel->item

但是，提要通常包含多个新闻项。但也可能一个也没有。因此，该语句返回一个数组，可以通过 for-each 循环来遍历它：

foreach ($rss->channel->item as $item) {
  echo "<h2>" . $item->title . "</h2>";
  echo "<p>" . $item->description . "</p>";
}

通过从 RSS 提要中读取 link 元素值添加链接也很容易。只要在 PHP 中输出一个 a 元素，并使用 $item->link 检索 URL 即可。清单 3 增加了该元素并填充到清单 1 的框架中。

清单 3. 简单而完整的 PHP RSS 阅读器

 
<?php // Load and parse the XML document 
$rss =  simplexml_load_file('http://partners.userland.com/nytRss/nytHomepage.xml');
$title =  $rss->channel->title;
?>
<html xml:lang="en" lang="en">
<head>
  <title><?php echo $title; ?></title>
</head>
<body>

<h1><?php echo $title; ?></h1>

<?php
// Here we'll put a loop to include each item's title and description
foreach ($rss->channel->item as $item) {
  echo "<h2><a href='" . $item->link . "'>" . $item->title . "</a></h2>";
  echo "<p>" . $item->description . "</p>";
}
?>

</body>
</html>

这样就用 PHP 完成了一个简单的 RSS 阅读器：只需要几行 HTML 和几行 PHP。不算空白的话一共只有 20 行。当然，这个实现的功能还不够丰富，也不够优化或者健壮。我们来看看还能做什么。

回页首

错误处理

并非所有 RSS 提要都如期望的那样结构良好。XML 规范要求处理程序在发现结构良好性错误时停止处理文档，SimpleXML 是符合标准的 XML 处理程序。但是在发现错误时它没有提供多少帮助。一般来说，它在 php-errors 文件中记录错误（但是不包括详细的错误消息），simplexml-load-file() 函数返回 FALSE。如果不能确保解析的文件是结构良好的，在使用文件数据之前要检查错误，如清单 4 所示。

清单 4. 避免结构错误的输入

 
<?php
$rss =  simplexml_load_file('http://www.cafeaulait.org/today.rss');
if ($rss) {
  foreach ($rss->xpath('//title') as $title) {
    echo "<h2>" . $title . "</h2>";
  }
}
else {
  echo "Oops! The input is malformed!";
}
?>

其他常见的错误是文档实际上是结构良好的，但是没有在期望的地方包含期望的元素。如果项没有标题（比如在 top-100 这样的 RSS 提要中），$doc->rss->channel->item->title 这样的表达式会怎么样呢？最简单的办法是将返回值永远看作一个数组并循环遍历该数组。这样就可以判断元素比预期的多还是少。但是，如果确定只需要文档中的第一个元素 —— 即使有多个，可以按索引访问，索引号从零开始。比如，如果要请求一个项的标题，可以用如下代码：

$doc->rss->channel->item[0]->title[0]

如果没有第一项，或者第一项没有标题，该项就按照常规 PHP 数组索引越界处理。即结果是空，在将其插入输出 HTML 时，它会被转化成空白字符串。

识别和拒绝不打算处理的非预期格式通常属于 XML 验证解析器的范畴。然而，SimpleXML 不能针对文档类型定义（DTD）或模式进行验证。它只检查结构良好性。

回页首

处理名称空间

很多站点现在从 RSS 转向了 Atom。清单 5 显示了一个 Atom 文档的例子。该文档大部分和 RSS 的例子类似。但是增加了一些元数据，而且根元素变成了 feed 而不是 rss。feed 元素包含 entry 而不是项（item）。content 元素代替了 description。最重要的是，Atom 文档使用了名称空间，但 RSS 文档没有。这样，Atom 文档就可以在内容中内嵌真正的、没有转义的可扩展 HTML（XHTML）。

清单 5. Atom 文档

 
<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-US" 
      xml:base="http://www.cafeconleche.org/today.atom">
  <updated>2006-08-04T16:00:04-04:00</updated>
  <id>http://www.cafeconleche.org/</id>
  <title>Cafe con Leche XML News and Resources</title>
  <link rel="self" type="application/atom+xml" href="/today.atom"/>
  <rights>Copyright 2006 Elliotte Rusty Harold</rights>
  <entry>
    <title>Steve Palmer has posted a beta of Vienna 2.1, an open source 
           RSS/Atom client for Mac OS X. 
          </title>
    <content type="xhtml">
      <div xmlns="http://www.w3.org/1999/xhtml" 
          id="August_1_2006_25279" class="2006-08-01T07:01:19Z">

<p>
 Steve Palmer has posted a beta of <a shape="rect"
 href="http://www.opencommunity.co.uk/vienna21.php">Vienna
 2.1</a>, an open source RSS/Atom client for Mac OS X. Vienna
 is the first reader I've found acceptable for daily use; not
 great but good enough. (Of course my standards for "good
 enough" are pretty high.) 2.1 focuses on improving the user
 interface with a unified layout that lets you scroll through
 several articles, article filtering (e.g. read all articles
 since the last refresh), manual folder reordering, a new get
 info window, and an improved condensed layout.
</p>

</div>
    </content>
    <link href="/#August_1_2006_25279"/>
    <id>http://www.cafeconleche.org/#August_1_2006_25279</id>
    <updated>2006-08-01T07:01:19Z</updated>
  </entry>
  <entry>
    <title>Matt Mullenweg has released Wordpress 2.0.4, 
           a blog engine based on PHP and MySQL.
          </title>
    <content type="xhtml">
      <div xmlns="http://www.w3.org/1999/xhtml" 
           id="August_1_2006_21750" class="2006-08-01T06:02:30Z">

<p>
 Matt Mullenweg has released <a shape="rect"
 href="http://wordpress.org/development/2006/07/wordpress-204
 /">Wordpress 2.0.4</a>, a blog engine based on PHP and
 MySQL. 2.0.4 plugs various security holes, mostly involving
 plugins.
</p>
</div>
    </content>
    <link href="/#August_1_2006_21750"/>
    <id>http://www.cafeconleche.org/#August_1_2006_21750</id>
    <updated>2006-08-01T06:02:30Z</updated>
  </entry>

</feed>

虽然元素名称变了，但用 SimpleXML 处理 Atom 文档的基本方法和 RSS 相同。一个区别是现在请求被命名的元素和本地名称时必须指定名称空间统一资源标识符（URI）。这需要两个步骤：首先通过向 children() 函数传递名称空间 URI 请求给定名称空间中的孩子元素。然后用那个名称空间中适当的本地名称请求元素。假设第一次把 Atom 提要加载到变量 $feed 中，如下所示：

$feed = simplexml_load_file('http://www.cafeconleche.org/today.atom');

下面的两行寻找 title 元素：

$children =  $feed->children('http://www.w3.org/2005/Atom');
$title = $children->title;

如果愿意可以将这些代码压缩成一行，虽然行会变得有点长。名称空间中的所有其他元素也必须类似处理。清单 6 给出了一个完整的 PHP 页面，其中显示带名称空间的 Atom 提要中的标题。

清单 6. 简单的 PHP Atom 标题阅读器

 
<?php $feed =  simplexml_load_file('http://www.cafeconleche.org/today.atom');
$children =  $feed->children('http://www.w3.org/2005/Atom');
$title = $children->title;
?>
<html xml:lang="en" lang="en">
<head>
  <title><?php echo $title; ?></title>
</head>
<body>

<h1><?php echo $title; ?></h1>

<?php

$entries = $children->entry;
foreach ($entries as $entry) {

  $details = $entry->children('http://www.w3.org/2005/Atom');
  echo "<h2>" . $details->title . "</h2>";
}
?>

</body>
</html>

回页首

混合的内容

为什么这个例子中只显示标题行呢？因为在 Atom 中，记录的内容可以包含报道的全部文本：不仅仅是普通文本，还包括标记。这是一种叙述性结构：行中的词句是供人阅读的。和多数的此类数据相似，也有大量的混合内容。于是 XML 就不那么简单了，SimpleXML 方法也开始显示出了一些不足之处。由于不能合理地处理混合内容，这一不足使其在很多应用中被排除了。

可以做到一点，但这不是一个完整的解决方案，只能用于 content 元素包含真正的 XHTML 的情况。可以使用 asXML() 函数将这些 XHTML 作为非解析源代码直接复制到输出中，比如：

echo "<p>" . $details->content->asXML() . "</p>";

生成的结果如清单 7 所示。

清单 7. asXML 输出

 
  <content type="xhtml">
    <div xmlns="http://www.w3.org/1999/xhtml" 
        id="August_7_2006_31098" class="2006-08-07T09:38:18Z">
    <p>
 Nikolai Grigoriev has released <a shape="rect"
 href="http://www.grigoriev.ru/svgmath">SVGMath 0.3</a>, a
 presentation MathML formatter that produces SVG written in
 pure Python and published under an MIT license. According to
 Grigoriev, "The new version can work with multiple-namespace
 documents (e.g. replace all MathML subtrees with SVG in an
 XSL-FO or XHTML document); configuration is made more
 flexible, and several bugs are fixed. There is also a
 stylesheet to adjust the vertical position of the resulting
 SVG image in XSL-FO."
    </p>
    </div>
  </content>

这不是纯粹的 XHTML。content 元素悄悄从 Atom 文档中溜了进来，您真的不愿这样。更糟的是，它的名称空间不对，因此不能被识别。幸运的是，这个多出来的元素实际上没有多大害处，因为 Web 浏览器会忽略不认识的任何标签。完成的文档是无效的，但是关系不大。如果还是觉得别扭，可以通过字符串操作将其去掉，如下所示：

  $description = $details->content->asXML();
  $tags = array('<content type="xhtml"'>", "</content>");
  $notags  = array("", "");
  $description = str_replace($tags, $notags, $description);

为了使代码更加健壮，可以使用正则表达式而不是假定起始标签和前面相同。具体来说，可以考虑各种可能的属性：

  // end-tag is fixed in form so it's easy to replace
  $description = str_replace("</content>", "", $description);
  // remove start-tag, possibly including attributes and white space
  $description = ereg_replace("<content[^>]*>", "", $description);

即使这样改进之后，代码还是会在注释、处理指令和 CDATA 节上出错。无论怎么分解，恐怕都不会简单了。混合内容实际上超出了 SimpleXML 所能处理的范围。

回页首

XPath

只要知道文档有什么元素以及在什么位置，$rss->channel->item->title 这样的表达式很方便。但是，不一定会知道得这么清楚。比方说，在 XHTML 中，标题元素（h1、h2、h3 等等）可以是 body、div、table 或其他几种元素的孩子。此外，div、table、blockquote 及其他元素又可以互相嵌套多次。在很多不那么明确的场合中，使用 //h1 或 //h1[contains('Ben')]这样的 XPath 表达式更方便。SimpleXML 通过 xpath() 函数支持这种功能。

清单 8 显示的 PHP 页面列出了 RSS 文档中的所有标题，包括提要本身以及每个项的标题。

清单 8. 使用 XPath 查找 title 元素

 
<html xml:lang="en" lang="en">
<head>
  <title>XPath Example</title>
</head>
<body>

<?php
$rss =  simplexml_load_file('http://partners.userland.com/nytRss/nytHomepage.xml');
foreach ($rss->xpath('//title') as $title) {
  echo "<h2>" . $title . "</h2>";
}
?>

</body>
</html>

SimpleXML 仅支持 XPath 位置路径及位置路径的组合。不支持那些不返回节点集的 XPath 表达式，如 count(//para) 或contains(title)。

从 PHP 5.1 版开始，SimpleXML 可以直接对带名称空间的文档使用 XPath 查询。和通常一样，XPath 位置路径必须使用名称空间前缀，即使搜索的文档使用默认名称空间也仍然如此。registerXPathNamespace() 函数把前缀和后续查询中使用的名称空间 URL 联系在一起。比方说，如果要查询 Atom 文档中的所有 title 元素，应使用清单 9 中所示的代码。

清单 9. 使用 XPath 和名称空间

 
$atom =  simplexml_load_file('http://www.cafeconleche.org/today.atom');
$atom->registerXPathNamespace('atm', 'http://www.w3.org/2005/Atom');
$titles = $atom->xpath('//atm:title');
foreach ($titles as $title) {
  echo "<h2>" . $title . "</h2>";
}

最后一点忠告：PHP 中的 XPath 速度非常慢。当改为 XPath 表达式之后页面加载延迟从难以觉察变成了几秒钟，即使是在负荷不高的本地服务器上。如果采用这些技术，必须使用某种缓存技术来获得适当的性能。不可能动态生成每个页面。

回页首

结束语

如果不需要处理混合内容，SimpleXML 对于 PHP 程序员的工具箱来说是个不错新玩意。其适用的情况很多。具体而言，它能够很好地处理简单的、类似记录的数据。只要文档层次不深、不很复杂，而且没有混合内容，SimpleXML 要比使用 DOM 简单得多。如果事先知道文档结构该工具将更有用，虽然通过 XPath 可以满足这种要求。虽然不支持验证和混合内容有点不方便，但不是绝对的。很多简单格式没有混合内容，而且很多应用只涉及到可预知的数据格式。如果符合您的需要，可以自己尝试一下 SimpleXML。只要对错误处理稍加注意，并且通过缓存来解决性能问题，SimpleXML 可以成为 PHP 中一种可靠、健壮的 XML 处理方法。