java xpath

Vipan Singla e-mail: vipan@vipan.com
XML and XPath Usage
Most common DOM interfaces:
Node: The base datatype of the DOM.
Element: The vast majority of the objects you抣l deal with are "Elements".
Attr: Represents an attribute of an "Element".
Text: The actual content of an "Element" or "Attr".
Document: Represents the entire XML document. A "Document" object is often referred to as a DOM tree.
Common DOM methods

Document.getDocumentElement()
Returns the root "element" of the document. It is the top level tag in the document. It is different from the "root" itself which is just a "/". So the root element resides below the "/". There are other elements below the "/" such as an <xml> declaration or a "comment".
Node.getFirstChild() and Node.getLastChild()
Returns the first or last child of a given Node.
Node.getNextSibling() and Node.getPreviousSibling()
Return the next or previous element, node or whatever at the same level as the node itself in the document tree.
Node.getAttribute(attrName)
For a given Node, returns the attribute with the requested name. For example, if you want the Attr object for the attribute named id, use getAttribute("id").
getElementsByTagName("tag_name")
Retrieve all of the <tag_name> elements in the document. This method saves the trouble of writing code to traverse the entire tree. Or, you can use XPath. See below.
All Seven Kinds of Nodes
The root
Elements
Text
Attributes
Namespaces
Processing instructions
Comments
XPath Abbreviated Syntax Examples
In all cases below, the "context node" is the node you want to start searching from in a pre-parsed "document" object. You must be holding a reference to the context node. Remember, a Document object is a type of Node. For example, in:
NodeIterator nl = XPathAPI.selectNodeIterator(node, "para");
, the argument node is the context node you want to start searching from. You may obtain the "Document" object using:
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new File("C:\some_dir\some_file.xml");
The parse method can also take an "InputStream", "URL" or XML "InputSource" object. After you get the "Document" object, you should collapse all contiguous whitespace and "Text" nodes into one "text" node using:
doc.getDocumentElement().normalize();
Otherwise, your "Document" object is going to contain so many useless (empty) "Text" nodes that you are going to have a tough time reaching the useful textual content within an element.
para selects the "para" element children of the context node
* selects all element children of the context node
text() selects all text node children of the context node
@name selects the "name" attribute of the context node
@* selects all the attributes of the context node
para[1] selects the first "para" child of the context node
para[last()] selects the last "para" child of the context node
*/para selects all para grandchildren of the context node
/doc/chapter[5]/section[2] selects the second section of the fifth chapter of the doc
chapter//para selects the "para" element descendants of the "chapter" element children of the context node
//para selects all the para descendants of the "document root" and thus selects all "para" elements in the same document as the context node
//olist/item selects all the "item" elements in the same document as the context node that have an "olist" parent
. selects the context node itself
.//para selects the "para" element descendants of the context node
.. selects the parent of the context node
../@lang selects the "lang" attribute of the parent of the context node
para[@type="warning"] selects all "para" children of the context node that have a "type" attribute with value "warning"
para[@type="warning"][5] selects the fifth "para" child of the context node that has a "type" attribute with value "warning"
para[5][@type="warning"] selects the fifth "para" child of the context node if that child has a "type" attribute with value "warning"
chapter[title="Introduction"] selects the "chapter" children of the context node that have one or more "title" children with string-value equal to "Introduction" (Use this to match to a particular element which contains the text value you desire)
chapter[title] selects the "chapter" children of the context node that have one or more "title" children
employee[@secretary and @assistant] selects all the "employee" children of the context node that have both a "secretary" attribute and an "assistant" attribute
The default axes is "child". For example, a location path div/para is short for child::div/child::para.
So, abbreviation for attribute:: is @. For example, a location path para[@type="warning"] is short for child::para[attribute::type="warning"].
// is short for /descendant-or-self::node()/. For example, //para is short for /descendant-or-self::node()/child::para. Here, even a "para" element that is a document element will be selected since the document element node is a child of the root node.
The location path //para[1] does not mean the same as the location path /descendant::para[1]. The latter selects the first descendant para element; the former selects all descendant para elements that are the first para children of their parents.
A location step of . is short for self::node(). This is particularly useful in conjunction with //. For example, the location path .//para is short for self::node()/descendant-or-self::node()/child::para and so will select all para descendant elements of the context node.
Similarly, a location step of .. is short for parent::node(). For example, ../title is short for parent::node()/child::title and so will select the title children of the parent of the context node.
Demonstration Example of Using XML Xpath in a Java program
Save this code in XPathDemo.java file:
import java.io.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.w3c.dom.*;
import org.w3c.dom.traversal.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;
import org.apache.xpath.*;


/**
 * This class demonstrates how to use Java to parse an XML file and get
 * any element's content or attribute's value WITHOUT "walking the tree".
 * It uses XPath to achieve this goal.  Also shown is a trivial usage of
 * an XML transform to print the parsed XML file to console.
 *
 * Some of the program snippets are by http://xml.apache.org.
 *
 */
public class XPathDemo {

  public static void main(String[] args) {

    if (args.length < 2) {
      System.out.println("Usage: ");
      System.out.println(
        "java -classpath xerces.jar;.;xalan.jar "
        + " XPathDemo your-file.xml your-xpath-string");
      return;
    }

    try {

      /****************************************************************
       * How to use turn an XML file into a document object in Java
       ****************************************************************/

      System.out.println("Parsing XML file " + args[0] + " ...");

      DocumentBuilderFactory docBuilderFactory =
                    DocumentBuilderFactory.newInstance();
      DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
      // Parse the XML file and build the Document object in RAM
      Document doc = docBuilder.parse(new File(args[0]));

      // Normalize text representation.
      // Collapses adjacent text nodes into one node.
      doc.getDocumentElement().normalize();
      /****************************************************************/


      /****************************************************************
       * How to use xpath to extract info from document object in Java
       ****************************************************************/
      String xpath = args[1];
      System.out.println("\nQuerying DOM using xpath string:" + xpath);

      // Catches the first node that meets the criteria of xpath string
      String str = XPathAPI.eval(doc, xpath).toString();
      System.out.println("=>" + str + "<=\n");
      /****************************************************************/


      /****************************************************************
       * How to get root node of the document object
       ****************************************************************/
      Node root = doc.getDocumentElement();
      System.out.println("\nRoot element of the doc is =>"
                + root.getNodeName() + "<=");
      /****************************************************************/


      /****************************************************************
       * How to print the parsed xml file right back to system out
       ****************************************************************/
      String xpathString = args[1];
      // Set up an identity transformer to use as serializer.
      // This one can write input to output stream
      Transformer serializer =
          TransformerFactory.newInstance().newTransformer();
      serializer.setOutputProperty(
            OutputKeys.OMIT_XML_DECLARATION, "yes");

      // Use the simple XPath API to select a nodeIterator.
      System.out.println("\nPrinting subtree under xpath =>"
                        + xpathString + "<=");
      NodeIterator nl = XPathAPI.selectNodeIterator(doc, xpathString);

      Node n;
      while ((n = nl.nextNode()) != null) {
        // Serialize the found nodes to System.out
        serializer.transform(
            new DOMSource(n),
            new StreamResult(System.out));
      }
      /****************************************************************/

    }
    catch (SAXParseException err) {
      String msg =
        "** SAXParseException"
          + ", line "
          + err.getLineNumber()
          + ", uri "
          + err.getSystemId()
          + "\n"
          + "   "
          + err.getMessage();
      System.out.println(msg);
      // print stack trace
      Exception x = err.getException();
      ((x == null) ? err : x).printStackTrace();
    }
    catch (SAXException e) {
      String msg = "SAXException";
      System.out.println(msg);
      Exception x = e.getException();
      ((x == null) ? e : x).printStackTrace();
    }
    catch (Exception e) {
      e.printStackTrace();
    }
    catch (Throwable t) {
      t.printStackTrace();
      String msg = "Some other exception while getting XML";
      System.out.println(msg);
    }
  }
}
Download Xalan from http://xml.apache.org, extract/unzip the downloaded file, find xerces.jar and xalan.jar files and copy these files in the same directory where you saved the above code in XPathDemo.java file (just to make the demonstration easier).
The download is about 7MB although the two files you need are about 2MB combined. The rest is documentation and the full Java source of Xalan!
Compile XPathDemo.java using:
javac -classpath xerces.jar;.;xalan.jar XPathDemo.java
Get or create any XML file. Here is a simple example. Save it as, say, example.xml file in the same directory as the above files (just to make the demonstration easier).
<demo-xpath>
  <database-access db-name="db1">
    Here is to xpath!
    <username>scott</username>
    <password>tiger</password>
    May be some text here.
    Some more text here.
  </database-access>
  Last text line!
</demo-xpath>
Now, you have apache's XML parser in xerces.jar, XPath API in xalan.jar, your Java program in XPathDemo.class and a sample XML file example.xml. You can try to run your Java program and pass it the XML file name and any XPath string. And see what the program gives you! Some generic XPath strings to try are . for current node (in this Java program, same as the root node) and / for root node.
Run XPathDemo using these commands one by one as examples:
java -classpath xerces.jar;.;xalan.jar XPathDemo example.xml /
java -classpath xerces.jar;.;xalan.jar XPathDemo example.xml .
java -classpath xerces.jar;.;xalan.jar XPathDemo example.xml /demo-xpath
java -classpath xerces.jar;.;xalan.jar XPathDemo example.xml //@db-name
java -classpath xerces.jar;.;xalan.jar XPathDemo example.xml //username
These runs will demonstrate different ways to use XPath to get the content of an element or the value of an attribute.
If you specify a non-existent element or attribute, the toString() method of XObject obtained from the XpathAPI.eval(...) method returns an empty string, not a nullPointerException, by design. Actually, a subclass of XObject, XNull, is returned whose toString() method has been programmed to return an empty string. See Xalan's javadoc.
Core Functions
Each function in the function library is specified using a function prototype, which gives the return type, the name of the function, and the type of the arguments. If an argument type is followed by a question mark, then the argument is optional; otherwise, the argument is required.
Node-Set Functions
number last(): The last node "number" in the node-set.
number position()
number count(node-set): Number of nodes in the node-set.
node-set id(object): id("foo") selects the element with unique ID "foo" and id("foo")/child::para[position()=5] selects the fifth "para" child of the element with unique ID "foo".
string local-name(node-set?): Local part of the expanded-name of the node in the argument node-set that is first in document order. If the argument node-set is empty or the first node has no expanded-name, an empty string is returned. If the argument is omitted, it defaults to a node-set with the context node as its only member.
string namespace-uri(node-set?): Some advanced function.
string name(node-set?): Some advanced function. Returns weird-looking name.
String Functions
string string(object?): Converts an object to a string as follows:
A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order. If the node-set is empty, an empty string is returned.
A number is converted to a string as follows:
NaN is converted to the string NaN
positive zero is converted to the string 0
negative zero is converted to the string 0
positive infinity is converted to the string Infinity
negative infinity is converted to the string -Infinity
if the number is an integer, the number is represented in decimal form as a Number with no decimal point and no leading zeros, preceded by a minus sign (-) if the number is negative
otherwise, the number is represented in decimal form as a Number including a decimal point with at least one digit before the decimal point and at least one digit after the decimal point, preceded by a minus sign (-) if the number is negative.
The boolean false value is converted to the string false. The boolean true value is converted to the string true.
An object of a type other than the four basic types is converted to a string in a way that is dependent on that type.
If the argument is omitted, it defaults to a node-set with the context node as its only member.
NOTE: The string function is not intended for converting numbers into strings for presentation to users. The format-number function and xsl:number element in [XSLT] provide this functionality.
string concat(string, string, string*): Concatenates its arguments.
boolean starts-with("string1", "string2"): Checks if "string1" starts with "string2".
boolean contains("string1", "string2"): Checks if "string1" contains "string2".
string substring-before("string1", "string2"): Returns a part of "string1" up to the first occurance of start of "string2". Or, empty string if no "string2" found.
string substring-after(string, string): Similar to above.
string substring(string, number1, number2?): Substring starting at number1 index position. number2 is end index position if present, otherwise go till the end.
More precisely, each character in the string (see [3.6 Strings]) is considered to have a numeric position: the position of the first character is 1, the position of the second character is 2 and so on. This differs from Java and ECMAScript, in which the String.substring method treats the position of the first character as 0.
The returned substring contains those characters for which the position of the character is greater than or equal to the rounded value of the second argument and, if the third argument is specified, less than the sum of the rounded value of the second argument and the rounded value of the third argument; the comparisons and addition used for the above follow the standard IEEE 754 rules; rounding is done as if by a call to the round function. The following examples illustrate various unusual cases:
substring("12345", 1.5, 2.6) returns "234"

substring("12345", 0, 3) returns "12"

substring("12345", 0 div 0, 3) returns ""

substring("12345", 1, 0 div 0) returns ""

substring("12345", -42, 1 div 0) returns "12345"

substring("12345", -1 div 0, 1 div 0) returns ""
number string-length(string?): Number of characters in the string. If no argument, returns length of string-value of context node.
string normalize-space(string?): Removes leading and trailing whitespace and replaces sequences of whitespace characters with a single space. If no argument, returns length of string-value of context node.
string translate(string, string1, string2): In "string", replaces occurrences of characters in "string1" with character at the corresponding position in "string2". For example, translate("bar","abc","ABC") returns the string BAr. If there is a character in the second argument string with no character at a corresponding position in the third argument string (because the second argument string is longer than the third argument string), then occurrences of that character in the first argument string are removed. For example, translate("--aaa--","abc-","ABC") returns "AAA". If a character occurs more than once in the second argument string, then the first occurrence determines the replacement character. If the third argument string is longer than the second argument string, then excess characters are ignored. Generally used for case-conversion.
Boolean Functions
boolean boolean(object): Converts object to a boolean as follows:
a number is true if and only if it is neither positive or negative zero nor NaN
a node-set is true if and only if it is non-empty
a string is true if and only if its length is non-zero
an object of a type other than the four basic types is converted to a boolean in a way that is dependent on that type
boolean not(boolean): Reverses the argument.
boolean true(): Returns true.
boolean false(): Returns false.
boolean lang(string): Some advanced function
Number Functions
number number(object?): Converts object to a number as follows:
a string that consists of optional whitespace followed by an optional minus sign followed by a Number followed by whitespace is converted to a number that is nearest to the mathematical value represented by the string; any other string is converted to NaN
boolean true is converted to 1; boolean false is converted to 0
a node-set is first converted to a string as if by a call to the string function and then converted in the same way as a string argument
an object of a type other than the four basic types is converted to a number in a way that is dependent on that type
If the argument is omitted, it defaults to a node-set with the context node as its only member.
number sum(node-set): Sum total of all nodes in node-set after converting their string-values to numbers.
number floor(number): Lower integer than the number
number ceiling(number): Higher integer than the number
number round(number): The round function returns the number that is closest to the argument and that is an integer. If there are two such numbers, then the one that is closest to positive infinity is returned. If the argument is NaN, then NaN is returned. If the argument is positive infinity, then positive infinity is returned. If the argument is negative infinity, then negative infinity is returned. If the argument is positive zero, then positive zero is returned. If the argument is negative zero, then negative zero is returned. If the argument is less than zero, but greater than or equal to -0.5, then negative zero is returned.
NOTE: For these last two cases, the result of calling the round function is not the same as the result of adding 0.5 and then calling the floor function.
Data Model
XPath operates on an XML document as a tree. For all seven types of node, there is a way of determining a string-value for a node of that type. For some types of node, the string-value is part of the node; for other types of node, the string-value is computed from the string-value of descendant nodes.
NOTE: For element nodes and root nodes, the string-value of a node is not the same as the string returned by the DOM nodeValue method (see [DOM]).
There is an ordering, document order, defined on all the nodes in the document corresponding to the order in which the first character of the XML representation of each node occurs in the XML representation of the document after expansion of general entities. Thus, the root node will be the first node. Element nodes occur before their children. Thus, document order orders element nodes in order of the occurrence of their start-tag in the XML (after expansion of entities). The attribute nodes and namespace nodes of an element occur before the children of the element. The namespace nodes are defined to occur before the attribute nodes. The relative order of namespace nodes is implementation-dependent. The relative order of attribute nodes is implementation-dependent. Reverse document order is the reverse of document order.
Root nodes and element nodes have an ordered list of child nodes.
Nodes never share children
Every node other than the root node has exactly one parent, which is either an element node or the root node. A root node or an element node is the parent of each of its child nodes. The descendants of a node are the children of the node and the descendants of the children of the node.
Root Node is the root of the tree. A root node does not occur except as the root of the tree. The element node for the document element is a child of the root node. The root node also has as children processing instruction and comment nodes for processing instructions and comments that occur in the prolog and after the end of the document element.
The string-value of the root node is the concatenation of the string-values of all text node descendants of the root node in document order.
The children of an element node are the element nodes, comment nodes, processing instruction nodes and text nodes for its content. Entity references to both internal and external entities are expanded. Character references are resolved.
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
Each element node has an associated set of attribute nodes; the element is the parent of each of these attribute nodes; however, an attribute node is not a child of its parent element.
NOTE: This is different from the DOM, which does not treat the element bearing an attribute as the parent of the attribute.
Elements never share attribute nodes.
The = operator tests whether two nodes have the same value, not whether they are the same node. Thus attributes of two different elements may compare as equal using =, even though they are not the same node.
An attribute node has a normalized string-value. If it is an empty string, it results in an attribute node whose string-value is a zero-length string.
There is a comment node for every comment, except for any comment that occurs within the document type declaration.
The string-value of comment is the content of the comment not including the opening .
A text node never has an immediately following or preceding sibling that is another text node. The string-value of a text node is the character data. A text node always has at least one character of data.
A CDATA section is treated as if the <![CDATA[ and ]]> were removed and every occurrence of < and & were replaced by & l t ; (no spaces) and & a m p ; (no spaces) respectively.
� Vipan Singla 2000