HTML Parser Home Page

org.htmlparser
Class AbstractNode

java.lang.Object
  extended byorg.htmlparser.AbstractNode
All Implemented Interfaces:
Node, Serializable
Direct Known Subclasses:
RemarkNode, StringNode, TagNode

public abstract class AbstractNode
extends Object
implements Node, Serializable

AbstractNode, which implements the Node interface, is the base class for all types of nodes, including tags, string elements, etc

See Also:
Serialized Form

Field Summary
protected  NodeList children
          The children of this node.
protected  Page mPage
          The page this node came from.
protected  int nodeBegin
          The beginning position of the tag in the line
protected  int nodeEnd
          The ending position of the tag in the line
protected  Node parent
          The parent of this node.
 
Constructor Summary
AbstractNode(Page page, int start, int end)
          Create an abstract node with the page positions given.
 
Method Summary
abstract  void accept(Object visitor)
          Apply the visitor object (of type NodeVisitor) to this node.
 void collectInto(NodeList list, NodeFilter filter)
          Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node satisfies the filtering criteria.
 void doSemanticAction()
          Perform the meaning of this tag.
 int elementBegin()
          Deprecated. Use getStartPosition().
 int elementEnd()
          Deprecated. Use getEndPosition().
 NodeList getChildren()
          Get the children of this node.
 int getEndPosition()
          Gets the ending position of the node.
 Page getPage()
          Get the page this node came from.
 Node getParent()
          Get the parent of this node.
 int getStartPosition()
          Gets the starting position of the node.
 String getText()
          Returns the text of the string line
 void setChildren(NodeList children)
          Set the children of this node.
 void setEndPosition(int position)
          Sets the ending position of the node.
 void setPage(Page page)
          Set the page this node came from.
 void setParent(Node node)
          Sets the parent of this node.
 void setStartPosition(int position)
          Sets the starting position of the node.
 void setText(String text)
          Sets the string contents of the node.
abstract  String toHtml()
          This method will make it easier when using html parser to reproduce html pages (with or without modifications) Applications reproducing html can use this method on nodes which are to be used or transferred as they were recieved, with the original html
 String toHTML()
          Deprecated. - use toHtml() instead
abstract  String toPlainTextString()
          Returns a string representation of the node.
abstract  String toString()
          Return the string representation of the node.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

mPage

protected Page mPage
The page this node came from.


nodeBegin

protected int nodeBegin
The beginning position of the tag in the line


nodeEnd

protected int nodeEnd
The ending position of the tag in the line


parent

protected Node parent
The parent of this node.


children

protected NodeList children
The children of this node.

Constructor Detail

AbstractNode

public AbstractNode(Page page,
                    int start,
                    int end)
Create an abstract node with the page positions given. Remember the page and start & end cursor positions.

Parameters:
page - The page this tag was read from.
start - The starting offset of this node within the page.
end - The ending offset of this node within the page.
Method Detail

toPlainTextString

public abstract String toPlainTextString()
Returns a string representation of the node. This is an important method, it allows a simple string transformation of a web page, regardless of a node.
Typical application code (for extracting only the text from a web page) would then be simplified to :
 Node node;
 for (Enumeration e = parser.elements();e.hasMoreElements();) {
    node = (Node)e.nextElement();
    System.out.println(node.toPlainTextString()); // Or do whatever processing you wish with the plain text string
 }
 

Specified by:
toPlainTextString in interface Node

toHtml

public abstract String toHtml()
This method will make it easier when using html parser to reproduce html pages (with or without modifications) Applications reproducing html can use this method on nodes which are to be used or transferred as they were recieved, with the original html

Specified by:
toHtml in interface Node

toString

public abstract String toString()
Return the string representation of the node. Subclasses must define this method, and this is typically to be used in the manner
System.out.println(node)

Specified by:
toString in interface Node
Returns:
java.lang.String

collectInto

public void collectInto(NodeList list,
                        NodeFilter filter)
Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node satisfies the filtering criteria.

This mechanism allows powerful filtering code to be written very easily, without bothering about collection of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it at the top-level, as many tags (like form tags), can contain links embedded in them. We could get the links out by checking if the current node is a CompositeTag, and going through its children. So this method provides a convenient way to do this.

Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look like:

 NodeList collectionList = new NodeList();
 NodeFilter filter = new TagNameFilter ("A");
 for (NodeIterator e = parser.elements(); e.hasMoreNodes();)
      e.nextNode().collectInto(collectionList, filter);
 
Thus, collectionList will hold all the link nodes, irrespective of how deep the links are embedded.

Another way to accomplish the same objective is:

 NodeList collectionList = new NodeList();
 NodeFilter filter = new TagClassFilter (LinkTag.class);
 for (NodeIterator e = parser.elements(); e.hasMoreNodes();)
      e.nextNode().collectInto(collectionList, filter);
 
This is slightly less specific because the LinkTag class may be registered for more than one node name, e.g. <LINK> tags too.

Specified by:
collectInto in interface Node

elementBegin

public int elementBegin()
Deprecated. Use getStartPosition().

Returns the beginning position of the tag.

Specified by:
elementBegin in interface Node

elementEnd

public int elementEnd()
Deprecated. Use getEndPosition().

Returns the ending position fo the tag

Specified by:
elementEnd in interface Node

getPage

public Page getPage()
Get the page this node came from.

Returns:
The page that supplied this node.

setPage

public void setPage(Page page)
Set the page this node came from.

Parameters:
page - The page that supplied this node.

getStartPosition

public int getStartPosition()
Gets the starting position of the node.

Specified by:
getStartPosition in interface Node
Returns:
The start position.

setStartPosition

public void setStartPosition(int position)
Sets the starting position of the node.

Specified by:
setStartPosition in interface Node
Parameters:
position - The new start position.

getEndPosition

public int getEndPosition()
Gets the ending position of the node.

Specified by:
getEndPosition in interface Node
Returns:
The end position.

setEndPosition

public void setEndPosition(int position)
Sets the ending position of the node.

Specified by:
setEndPosition in interface Node
Parameters:
position - The new end position.

accept

public abstract void accept(Object visitor)
Description copied from interface: Node
Apply the visitor object (of type NodeVisitor) to this node.

Specified by:
accept in interface Node

toHTML

public final String toHTML()
Deprecated. - use toHtml() instead


getParent

public Node getParent()
Get the parent of this node. This will always return null when parsing without scanners, i.e. if semantic parsing was not performed. The object returned from this method can be safely cast to a CompositeTag.

Specified by:
getParent in interface Node
Returns:
The parent of this node, if it's been set, null otherwise.

setParent

public void setParent(Node node)
Sets the parent of this node.

Specified by:
setParent in interface Node
Parameters:
node - The node that contains this node. Must be a CompositeTag.

getChildren

public NodeList getChildren()
Get the children of this node.

Specified by:
getChildren in interface Node
Returns:
The list of children contained by this node, if it's been set, null otherwise.

setChildren

public void setChildren(NodeList children)
Set the children of this node.

Specified by:
setChildren in interface Node
Parameters:
children - The new list of children this node contains.

getText

public String getText()
Returns the text of the string line

Specified by:
getText in interface Node

setText

public void setText(String text)
Sets the string contents of the node.

Specified by:
setText in interface Node
Parameters:
text - The new text for the node.

doSemanticAction

public void doSemanticAction()
                      throws ParserException
Perform the meaning of this tag. The default action is to do nothing.

Specified by:
doSemanticAction in interface Node
Throws:
ParserException

© 2004 Somik Raha
Mar 14, 2004

HTML Parser is an open source library released under LGPL.
SourceForge.net