HTML Parser Home Page

org.htmlparser
Interface Node

All Known Implementing Classes:
AbstractNode, AbstractNodeDecorator

public interface Node


Method Summary
 void accept(Object visitor)
          Apply the visitor object (of type NodeVisitor) to this node.
 void collectInto(NodeList collectionList, NodeFilter filter)
          Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node satisfies the filtering criteria.
 void doSemanticAction()
          Perform the meaning of this tag.
 int elementBegin()
          Returns the beginning position of the tag.
 int elementEnd()
          Returns the ending position fo the tag
deprecated Use getEndPosition()
 NodeList getChildren()
          Get the children of this node.
 int getEndPosition()
          Gets the ending position of the node.
 Node getParent()
          Get the parent of this node.
 int getStartPosition()
          Gets the starting position of the node.
 String getText()
          Returns the text of the node.
 void setChildren(NodeList children)
          Set the children of this node.
 void setEndPosition(int position)
          Sets the ending position of the node.
 void setParent(Node node)
          Sets the parent of this node.
 void setStartPosition(int position)
          Sets the starting position of the node.
 void setText(String text)
          Sets the string contents of the node.
 String toHtml()
          This method will make it easier when using html parser to reproduce html pages (with or without modifications) Applications reproducing html can use this method on nodes which are to be used or transferred as they were recieved, with the original html
 String toPlainTextString()
          Returns a string representation of the node.
 String toString()
          Return the string representation of the node.
 

Method Detail

toPlainTextString

public String toPlainTextString()
Returns a string representation of the node. This is an important method, it allows a simple string transformation of a web page, regardless of a node.
Typical application code (for extracting only the text from a web page) would then be simplified to :
 Node node;
 for (Enumeration e = parser.elements();e.hasMoreElements();) {
    node = (Node)e.nextElement();
    System.out.println(node.toPlainTextString()); // Or do whatever processing you wish with the plain text string
 }
 


toHtml

public String toHtml()
This method will make it easier when using html parser to reproduce html pages (with or without modifications) Applications reproducing html can use this method on nodes which are to be used or transferred as they were recieved, with the original html


toString

public String toString()
Return the string representation of the node. Subclasses must define this method, and this is typically to be used in the manner
System.out.println(node)

Returns:
java.lang.String

collectInto

public void collectInto(NodeList collectionList,
                        NodeFilter filter)
Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node satisfies the filtering criteria.

This mechanism allows powerful filtering code to be written very easily, without bothering about collection of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it at the top-level, as many tags (like form tags), can contain links embedded in them. We could get the links out by checking if the current node is a CompositeTag, and going through its children. So this method provides a convenient way to do this.

Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look like:

 NodeList collectionList = new NodeList();
 NodeFilter filter = new TagNameFilter ("A");
 for (NodeIterator e = parser.elements(); e.hasMoreNodes();)
      e.nextNode().collectInto(collectionList, filter);
 
Thus, collectionList will hold all the link nodes, irrespective of how deep the links are embedded.

Another way to accomplish the same objective is:

 NodeList collectionList = new NodeList();
 NodeFilter filter = new TagClassFilter (LinkTag.class);
 for (NodeIterator e = parser.elements(); e.hasMoreNodes();)
      e.nextNode().collectInto(collectionList, filter);
 
This is slightly less specific because the LinkTag class may be registered for more than one node name, e.g. <LINK> tags too.


elementBegin

public int elementBegin()
Returns the beginning position of the tag.
deprecated Use getStartPosition()


elementEnd

public int elementEnd()
Returns the ending position fo the tag
deprecated Use getEndPosition()


getStartPosition

public int getStartPosition()
Gets the starting position of the node.

Returns:
The start position.

setStartPosition

public void setStartPosition(int position)
Sets the starting position of the node.

Parameters:
position - The new start position.

getEndPosition

public int getEndPosition()
Gets the ending position of the node.

Returns:
The end position.

setEndPosition

public void setEndPosition(int position)
Sets the ending position of the node.

Parameters:
position - The new end position.

accept

public void accept(Object visitor)
Apply the visitor object (of type NodeVisitor) to this node.


getParent

public Node getParent()
Get the parent of this node. This will always return null when parsing without scanners, i.e. if semantic parsing was not performed. The object returned from this method can be safely cast to a CompositeTag.

Returns:
The parent of this node, if it's been set, null otherwise.

setParent

public void setParent(Node node)
Sets the parent of this node.

Parameters:
node - The node that contains this node. Must be a CompositeTag.

getChildren

public NodeList getChildren()
Get the children of this node.

Returns:
The list of children contained by this node, if it's been set, null otherwise.

setChildren

public void setChildren(NodeList children)
Set the children of this node.

Parameters:
children - The new list of children this node contains.

getText

public String getText()
Returns the text of the node.


setText

public void setText(String text)
Sets the string contents of the node.

Parameters:
text - The new text for the node.

doSemanticAction

public void doSemanticAction()
                      throws ParserException
Perform the meaning of this tag. This is defined by the tag, for example the bold tag <B> may switch bold text on and off. Only a few tags have semantic meaning to the parser. These have to do with the character set to use (<META>), the base URL to use (<BASE>). Other than that, the semantic meaning is up to the application and it's custom nodes.

Throws:
ParserException

© 2004 Somik Raha
Mar 14, 2004

HTML Parser is an open source library released under LGPL.
SourceForge.net