XML SAX or DOM

Answer to Questions

Why do we need XML parser?

We need XML parser because we do not want to do everything in our application from scratch, and we need some "helper" programs or libraries to do something very low-level but very necessary to us. These low-level but necessary things include checking the well-formedness, validating the document against its DTD or schema (just for validating parsers), resolving character reference, understanding CDATA sections, and so on. XML parsers are just such "helper" programs and they will do all these jobsl. With XML parsers, we are shielded from a lot of these complexicities and we could concentrate ourselves on just programming at high-level through the API's implemented by the parsers, and thus gain programming efficiency.

Is there any XML-parser debugger?

I did not see any.

Which one is better, SAX or DOM ?

Both SAX and DOM parser have their advantages and disadvantages. Which one is better should depends on the characteristics of your application (please refer to some questions below).

Which parser can get better speed, DOM or SAX parsers?

SAX parser can get better speed.

What's the difference between tree-based API and event-based API?

A tree-based API is centered around a tree structure and therefore provides interfaces on components of a tree (which is a DOM document) such as Documentinterface,Node interface, NodeList interface, Element interface, Attr interface and so on. By contrast, however, an event-based API provides interfaces on handlers. There are four handler interfaces, ContentHandler interface, DTDHandler interface, EntityResolver interface and ErrorHandler interface.

What is the difference between a DOMParser and a SAXParser?

DOM parsers and SAX parsers work in different ways.

A DOM parser creates a tree structure in memory from the input document and then waits for requests from client. But a SAX parser does not create any internal structure. Instead, it takes the occurrences of components of a input document as events, and tells the client what it reads as it reads through the input document.
A DOM parser always serves the client application with the entire document no matter how much is actually needed by the client. But a SAX parser serves the client application always only with pieces of the document at any given time.
With DOM parser, method calls in client application have to be explicit and forms a kind of chain. But with SAX, some certain methods (usually overriden by the cient) will be invoked automatically (implicitly) in a way which is called "callback" when some certain events occur. These methods do not have to be called explicitly by the client, though we could call them explicitly.

There are a lot of XML parsers available now. What makes a parser a good parser?
How do we decide on which parser is good?

Ideally a good parser should be fast (time efficient),space efficient, rich in functionality and easy to use . But in reality, none of the main parsers have all these features at the same time. For example, a DOMParser is rich in functionality (because it creates a DOM tree in memory and allows you to access any part of the document repeatedly and allows you to modify the DOM tree), but it is space inefficient when the document is huge, and it takes a little bit long to learn how to work with it. A SAXParser, however, is much more space efficient in case of big input document (because it creates no internal structure). What's more, it runs faster and is easier to learn than DOMParser because its API is really simple. But from the functionality point of view, it provides less functions which mean that the users themselves have to take care of more, such as creating their own data structures. By the way, what is a good parser? I think the answer really depends on the characteristics of your application.

In what cases, we prefer DOMParser to SAXParser? In what cases, we prefer SAXParser to DOMParser?
What are some real world applications where using SAX parser is advantageous than using DOM parser and vice versa?
What are the usual application for a DOM parser and for a SAX parser?

In the following cases, using SAX parser is advantageous than using DOM parser.

The input document is too big for available memory (actually in this case SAX is your only choice)
You can process the document in small contiguous chunks of input. You do not need the entire document before you can do useful work
You just want to use the parser to extract the information of interest, and all your computation will be completely based on the data structures created by yourself. Actually in most of our applications, we create data structures of our own which are usually not as complicated as the DOM tree. From this sense, I think, the chance of using a DOM parser is less than that of using a SAX parser.

In the following cases, using DOM parser is advantageous than using SAX parser.

Your application needs to access widely separately parts of the document at the same time.
Your application may probably use a internal data structure which is almost as complicated as the document itself.
Your application has to modify the document repeatedly.
Your application has to store the document for a significant amount of time through many method calls.

Example (Use a DOM parser or a SAX parser?):

Assume that an instructor has an XML document containing all the personal information of the students as well as the points his students made in his class, and he is now assigning final grades for the students using an application. What he wants to produce, is a list with the SSN and the grades. Also we assume that in his application, the instructor use no data structure such as arrays to store the student personal information and the points.

If the instructor decides to give A's to those who earned the class average or above, and give B's to the others, then he'd better to use a DOM parser in his application. The reason is that he has no way to know how much is the class average before the entire document gets processed. What he probably need to do in his application, is first to look through all the students' points and compute the average, and then look through the document again and assign the final grade to each student by comparing the points he earned to the class average.

If, however, the instructor adopts such a grading policy that the students who got 90 points or more, are assigned A's and the others are assigned B's, then probably he'd better use a SAX parser. The reason is, to assign each student a final grade, he do not need to wait for the entire document to be processed. He could immediately assign a grade to a student once the SAX parser reads the grade of this student.

In the above analysis, we assumed that the instructor created no data structure of his own. What if he creates his own data structure, such as an array of strings to store the SSN and an array of integers to sto re the points ? In this case, I think SAX is a better choice, before this could save both memory and time as well, yet get the job done.

Well, one more consideration on this example. What if what the instructor wants to do is not to print a list, but to save the original document back with the grade of each student updated ? In this case, a DOM parser should be a better choice no matter what grading policy he is adopting. He does not need to create any data structure of his own. What he needs to do is to first modify the DOM tree (i.e., set value to the 'grade' node) and then save the whole modified tree. If he choose to use a SAX parser instead of a DOM parser, then in this case he has to create a data structure which is almost as complicated as a DOM tree before he could get the job done.

Is having two completely different ways(tree-based, event-based) to parse XML data a problem?

No. There exist two completely diffetent ways of parsing a XML document, so that you could choose between them according the characteristic of your application.

Does SAX or DOM support namespace ? If yes, how support it?

I am not sure about other parsers. But I am sure that both XerecesJ's SAXParser and DOMParser fully support namespace. The following callback methods are provided in both of them (note, although callback methods are typically used in SAX parser as I mentioned before, XereceJ's DOMParser actually also provides most of these callback methods)

void startNamespaceDelScope(int prefix, int uri),which is a callback for the start of the scope of a namespace declaration
void endNamespaceDelScope(int prefix), which is a callback for the end of the scope of a namespace declaration
protected boolean getNamespaces(), which returns true if the parser preprocesses namespaces
protected void setNamespaces(), which specifies whether the parser preprocesses namespaces

What's more, XerecesJ's DOM parser also has the following methods for namespaces in the Node interface

java.lang.String getNamespaceURI(), which gets the namespace URI of this node
java.lang.String getLocalName(), which gets the local name of this node
java.lang.String getPrefix(), which gets the namespace prefix of this node

What is the difference between System ID and public ID? Must both of them unique?

??????

For an event-based API, we are not building internal tree for the whole XML document, then how does the document get parsed and what does the data structure in memory look like when parsing a XML document?

The document gets parsed by the SAX parser reading the document and telling the client what it reads. A SAX parser itself does not create or leave anything in memory, but invokes the "callback" methods time by time depending what it sees. What data structure in memory looks like when parsing a XML document with an event-based parser, completely depends on the client. If the client creates no data structure, then there will be no data structure created or left in memory both during and after the parsing.

What do you think about NetBean's XML modele?

NetBeans is an open source IDE written in Java. Its XML modules provide generic XML support and infrastructure (please refer tohttp://xml.netbeans.org/user/features.html for more details). I have no experience working with NetBean.

Can SAX and DOM parsers be used at the same time?

Yes, of course, because the use of a DOM parser and a SAX parser is independent. For example, if your application needs to work on two XML documents, and does different things on each document, you could use a DOM parser on one document and a SAX parser on another, and then combine the results or make the processings cooperate with each other.

How does the eventbased parser notice that there is an event happening, since these events are not like click button or move the mouse?

Clicking a button or moving the mouse could be thought of as events, but events could be thought of in a more general way. For example, in a switch statement of C, if the switched variable gets some value, some 'case' will be taken and get executed. At this time, we may also say, one event has occurred. A SAX parser reads the document character by character or token by token. Once some patterns (such as the start tag or end tag) are met, it thinks of the occurrences of these patterns as events and invokes some certain methods overriden by the client.

Can an XHTML parser (Ex: validator.w3.org) be used as an XML parser ?

Yes, we can. But the functionalities available from there will not be as much as in XML parsers because XHTML parsers like validator.w3.org mainly check the well-formedness or validation.

Sample document for the example
<?xml version="1.0"?> 
<!DOCTYPE shapes [
<!ELEMENT shapes (circle)*>
<!ELEMENT circle (x,y,radius)>
<!ELEMENT x (#PCDATA)>
<!ELEMENT y (#PCDATA)>
<!ELEMENT radius (#PCDATA)>
<!ATTLIST circle color CDATA #IMPLIED>
]>

<shapes> 
          <circle color="BLUE"> 
                <x>20</x>
                <y>20</y>
                <radius>20</radius> 
          </circle>
          <circle color="RED" >
                <x>40</x>
                <y>40</y>
                <radius>20</radius> 
          </circle>
</shapes> 




Programs for the Example

program with DOMparser
import java.io.*;
import org.w3c.dom.*;
import org.apache.xerces.parsers.DOMParser;


public class shapes_DOM {
   static int numberOfCircles = 0;   // total number of circles seen
   static int x[] = new int[1000];   // X-coordinates of the centers
   static int y[] = new int[1000];   // Y-coordinates of the centers  
   static int r[] = new int[1000];   // radius of the circle
   static String color[] = new String[1000];  // colors of the circles 

   public static void main(String[] args) {   

      try{
         // create a DOMParser
         DOMParser parser=new DOMParser();
         parser.parse(args[0]);

         // get the DOM Document object
         Document doc=parser.getDocument();

         // get all the circle nodes
         NodeList nodelist = doc.getElementsByTagName("circle");
         numberOfCircles =  nodelist.getLength();

         // retrieve all info about the circles
         for(int i=0; i<nodelist.getLength(); i++) {

            // get one circle node
            Node node = nodelist.item(i);
  
            // get the color attribute 
            NamedNodeMap attrs = node.getAttributes();
            if(attrs.getLength() > 0)
               color[i]=(String)attrs.getNamedItem("color").getNodeValue();

            // get the child nodes of a circle node 
            NodeList childnodelist = node.getChildNodes();

            // get the x and y value 
            for(int j=0; j<childnodelist.getLength(); j++) {
               Node childnode = childnodelist.item(j);
               Node textnode = childnode.getFirstChild();//the only text node
               String childnodename=childnode.getNodeName(); 
               if(childnodename.equals("x")) 
                  x[i]= Integer.parseInt(textnode.getNodeValue().trim());
               else if(childnodename.equals("y")) 
                  y[i]= Integer.parseInt(textnode.getNodeValue().trim());
               else if(childnodename.equals("radius")) 
                  r[i]= Integer.parseInt(textnode.getNodeValue().trim());
            }

         }
         
         // print the result
         System.out.println("circles="+numberOfCircles);
         for(int i=0;i<numberOfCircles;i++) {
             String line="";
             line=line+"(x="+x[i]+",y="+y[i]+",r="+r[i]+",color="+color[i]+")";
             System.out.println(line);
         }

      }  catch (Exception e) {e.printStackTrace(System.err);}
   
    }

}

program with SAXparser
import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import org.apache.xerces.parsers.SAXParser;


public class shapes_SAX extends DefaultHandler {

   static int numberOfCircles = 0;   // total number of circles seen
   static int x[] = new int[1000];   // X-coordinates of the centers
   static int y[] = new int[1000];   // Y-coordinates of the centers
   static int r[] = new int[1000];   // radius of the circle
   static String color[] = new String[1000];  // colors of the circles

   static int flagX=0;    //to remember what element has occurred
   static int flagY=0;    //to remember what element has occurred
   static int flagR=0;    //to remember what element has occurred

   // main method 
   public static void main(String[] args) {   
      try{
         shapes_SAX SAXHandler = new shapes_SAX (); // an instance of this class
         SAXParser parser=new SAXParser();          // create a SAXParser object 
         parser.setContentHandler(SAXHandler);      // register with the ContentHandler 
         parser.parse(args[0]);
      }  catch (Exception e) {e.printStackTrace(System.err);}  // catch exeptions
   }

   // override the startElement() method
   public void startElement(String uri, String localName, 
                       String rawName, Attributes attributes) {
         if(rawName.equals("circle"))                      // if a circle element is seen
            color[numberOfCircles]=attributes.getValue("color");  // get the color attribute 
         
         else if(rawName.equals("x"))      // if a x element is seen set the flag as 1 
            flagX=1;
         else if(rawName.equals("y"))      // if a y element is seen set the flag as 2
            flagY=1;
         else if(rawName.equals("radius")) // if a radius element is seen set the flag as 3 
            flagR=1;
   }

   // override the endElement() method
   public void endElement(String uri, String localName, String rawName) {
         // in this example we do not need to do anything else here
         if(rawName.equals("circle"))                       // if a circle element is ended 
            numberOfCircles +=  1;                          // increment the counter 
   }

   // override the characters() method
   public void characters(char characters[], int start, int length) {
         String characterData = 
             (new String(characters,start,length)).trim(); // get the text
         
         if(flagX==1) {        // indicate this text is for <x> element 
             x[numberOfCircles] = Integer.parseInt(characterData);
             flagX=0;
         }
         else if(flagY==1) {  // indicate this text is for <y> element 
             y[numberOfCircles] = Integer.parseInt(characterData);
             flagY=0;
         }
         else if(flagR==1) {  // indicate this text is for <radius> element 
             r[numberOfCircles] = Integer.parseInt(characterData);
             flagR=0;
         }
   }

   // override the endDocument() method
   public void endDocument() {
         // when the end of document is seen, just print the circle info 
         System.out.println("circles="+numberOfCircles);
         for(int i=0;i<numberOfCircles;i++) {
             String line="";
             line=line+"(x="+x[i]+",y="+y[i]+",r="+r[i]+",color="+color[i]+")";
             System.out.println(line);
         }
   }
   

}