HTML Parser Home Page

org.htmlparser
Class Parser

java.lang.Object
  extended byorg.htmlparser.Parser
All Implemented Interfaces:
Serializable

public class Parser
extends Object
implements Serializable

This is the class that the user will use, either to get an iterator into the html page or to directly parse the page and print the results
Typical usage of the parser is as follows :
[1] Create a parser object - passing the URL and a feedback object to the parser
[2] Enumerate through the elements from the parser object
It is important to note that the parsing occurs when you enumerate, ON DEMAND. This is a thread-safe way, and you only get the control back after a particular element is parsed and returned, which could be the entire body.

See Also:
elements(), Serialized Form

Field Summary
protected static String lineSeparator
          Variable to store lineSeparator.
protected static Map mDefaultRequestProperties
          Default Request header fields.
protected  ParserFeedback mFeedback
          Feedback object.
protected  Lexer mLexer
          The html lexer associated with this parser.
static ParserFeedback noFeedback
          A quiet message sink.
static ParserFeedback stdout
          A verbose message sink.
static String VERSION_DATE
          The date of the version.
static double VERSION_NUMBER
          The floating point version number.
static String VERSION_STRING
          The display version.
static String VERSION_TYPE
          The type of version.
 
Constructor Summary
Parser()
          Zero argument constructor.
Parser(Lexer lexer)
          This constructor is present to enable users to plugin their own lexers.
Parser(Lexer lexer, ParserFeedback fb)
          This constructor enables the construction of test cases, with readers associated with test string buffers.
Parser(String resourceLocn)
          Creates a Parser object with the location of the resource (URL or file).
Parser(String resourceLocn, ParserFeedback feedback)
          Creates a Parser object with the location of the resource (URL or file) You would typically create a DefaultHTMLParserFeedback object and pass it in.
Parser(URLConnection connection)
          Constructor for non-standard access.
Parser(URLConnection connection, ParserFeedback fb)
          Constructor for custom HTTP access.
 
Method Summary
static Parser createParser(String inputHTML)
          Creates the parser on an input string.
 NodeIterator elements()
          Returns an iterator (enumeration) to the html nodes.
 Node[] extractAllNodesThatAre(Class nodeType)
          Convenience method to extract all nodes of a given class type.
 NodeList extractAllNodesThatMatch(NodeFilter filter)
          Extract all nodes matching the given filter.
 URLConnection getConnection()
          Return the current connection.
static Map getDefaultRequestProperties()
          Get the current default request header properties.
 String getEncoding()
          Get the encoding for the page this parser is reading from.
 ParserFeedback getFeedback()
          Returns the feedback.
 Lexer getLexer()
          Returns the reader associated with the parser
static String getLineSeparator()
           
 NodeFactory getNodeFactory()
          Get the current node factory.
 String getURL()
          Return the current URL being parsed.
static String getVersion()
          Return the version string of this parser.
static double getVersionNumber()
          Return the version number of this parser.
static void main(String[] args)
          The main program, which can be executed from the command line
static URLConnection openConnection(String string, ParserFeedback feedback)
          Opens a connection based on a given string.
static URLConnection openConnection(URL url, ParserFeedback feedback)
          Opens a connection using the given url.
 void parse(NodeFilter filter)
          Parse the given resource, using the filter provided.
 void reset()
          Reset the parser to start from the beginning again.
 void setConnection(URLConnection connection)
          Set the connection for this parser.
static void setDefaultRequestProperties(Map properties)
          Set the default request header properties.
 void setEncoding(String encoding)
          Set the encoding for the page this parser is reading from.
 void setFeedback(ParserFeedback fb)
          Sets the feedback object used in scanning.
 void setInputHTML(String inputHTML)
          Initializes the parser with the given input HTML String.
 void setLexer(Lexer lexer)
          Set the lexer for this parser.
static void setLineSeparator(String lineSeparatorString)
           
 void setNodeFactory(NodeFactory factory)
          Get the current node factory.
 void setURL(String url)
          Set the URL for this parser.
 void visitAllNodesWith(NodeVisitor visitor)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

VERSION_NUMBER

public static final double VERSION_NUMBER
The floating point version number.

See Also:
Constant Field Values

VERSION_TYPE

public static final String VERSION_TYPE
The type of version.

See Also:
Constant Field Values

VERSION_DATE

public static final String VERSION_DATE
The date of the version.

See Also:
Constant Field Values

VERSION_STRING

public static final String VERSION_STRING
The display version.

See Also:
Constant Field Values

mDefaultRequestProperties

protected static Map mDefaultRequestProperties
Default Request header fields. So far this is just "User-Agent".


mFeedback

protected ParserFeedback mFeedback
Feedback object.


mLexer

protected Lexer mLexer
The html lexer associated with this parser.


lineSeparator

protected static String lineSeparator
Variable to store lineSeparator. This is setup to read line.separator from the System property. However it can also be changed using the mutator methods. This will be used in the toHTML() methods in all the sub-classes of Node.


noFeedback

public static ParserFeedback noFeedback
A quiet message sink. Use this for no feedback.


stdout

public static ParserFeedback stdout
A verbose message sink. Use this for output on System.out.

Constructor Detail

Parser

public Parser()
Zero argument constructor. The parser is in a safe but useless state. Set the lexer or connection using setLexer() or setConnection().

See Also:
setLexer(Lexer), setConnection(URLConnection)

Parser

public Parser(Lexer lexer,
              ParserFeedback fb)
This constructor enables the construction of test cases, with readers associated with test string buffers. It can also be used with readers of the user's choice streaming data into the parser.

Important: If you are using this constructor, and you would like to use the parser to parse multiple times (multiple calls to parser.elements()), you must ensure the following:

Parameters:
lexer - The lexer to draw characters from.
fb - The object to use when information, warning and error messages are produced. If null no feedback is provided.

Parser

public Parser(URLConnection connection,
              ParserFeedback fb)
       throws ParserException
Constructor for custom HTTP access.

Parameters:
connection - A fully conditioned connection. The connect() method will be called so it need not be connected yet.
fb - The object to use for message communication.

Parser

public Parser(String resourceLocn,
              ParserFeedback feedback)
       throws ParserException
Creates a Parser object with the location of the resource (URL or file) You would typically create a DefaultHTMLParserFeedback object and pass it in.

Parameters:
resourceLocn - Either the URL or the filename (autodetects). A standard HTTP GET is performed to read the content of the URL.
feedback - The HTMLParserFeedback object to use when information, warning and error messages are produced. If null no feedback is provided.
See Also:
Parser(URLConnection,ParserFeedback)

Parser

public Parser(String resourceLocn)
       throws ParserException
Creates a Parser object with the location of the resource (URL or file). A DefaultHTMLParserFeedback object is used for feedback.

Parameters:
resourceLocn - Either the URL or the filename (autodetects).

Parser

public Parser(Lexer lexer)
This constructor is present to enable users to plugin their own lexers. A DefaultHTMLParserFeedback object is used for feedback. It can also be used with readers of the user's choice streaming data into the parser.

Important: If you are using this constructor, and you would like to use the parser to parse multiple times (multiple calls to parser.elements()), you must ensure the following:


Parser

public Parser(URLConnection connection)
       throws ParserException
Constructor for non-standard access. A DefaultHTMLParserFeedback object is used for feedback.

Parameters:
connection - A fully conditioned connection. The connect() method will be called so it need not be connected yet.
See Also:
Parser(URLConnection,ParserFeedback)
Method Detail

setLineSeparator

public static void setLineSeparator(String lineSeparatorString)
Parameters:
lineSeparatorString - New Line separator to be used

getVersion

public static String getVersion()
Return the version string of this parser.

Returns:
A string of the form:
 "[floating point number] ([build-type] [build-date])"
 

getVersionNumber

public static double getVersionNumber()
Return the version number of this parser.

Returns:
A floating point number, the whole number part is the major version, and the fractional part is the minor version.

getDefaultRequestProperties

public static Map getDefaultRequestProperties()
Get the current default request header properties. A String-to-String map of header keys and values. These fields are set by the parser when creating a connection.


setDefaultRequestProperties

public static void setDefaultRequestProperties(Map properties)
Set the default request header properties. A String-to-String map of header keys and values. These fields are set by the parser when creating a connection. Some of these can be set directly on a URLConnection, i.e. If-Modified-Since is set with setIfModifiedSince(long), but since the parser transparently opens the connection on behalf of the developer, these properties are not available before the connection is fetched. Setting these request header fields affects all subsequent connections opened by the parser. For more direct control create a URLConnection and set it on the parser.

From RFC 2616 Hypertext Transfer Protocol -- HTTP/1.1:

 5.3 Request Header Fields
 
    The request-header fields allow the client to pass additional
    information about the request, and about the client itself, to the
    server. These fields act as request modifiers, with semantics
    equivalent to the parameters on a programming language method
    invocation.
 
        request-header = Accept                   ; Section 14.1
                       | Accept-Charset           ; Section 14.2
                       | Accept-Encoding          ; Section 14.3
                       | Accept-Language          ; Section 14.4
                       | Authorization            ; Section 14.8
                       | Expect                   ; Section 14.20
                       | From                     ; Section 14.22
                       | Host                     ; Section 14.23
                       | If-Match                 ; Section 14.24
                       | If-Modified-Since        ; Section 14.25
                       | If-None-Match            ; Section 14.26
                       | If-Range                 ; Section 14.27
                       | If-Unmodified-Since      ; Section 14.28
                       | Max-Forwards             ; Section 14.31
                       | Proxy-Authorization      ; Section 14.34
                       | Range                    ; Section 14.35
                       | Referer                  ; Section 14.36
                       | TE                       ; Section 14.39
                       | User-Agent               ; Section 14.43
 
    Request-header field names can be extended reliably only in
    combination with a change in the protocol version. However, new or
    experimental header fields MAY be given the semantics of request-
    header fields if all parties in the communication recognize them to
    be request-header fields. Unrecognized header fields are treated as
    entity-header fields.
 


setConnection

public void setConnection(URLConnection connection)
                   throws ParserException
Set the connection for this parser. This method creates a new Lexer reading from the connection. Trying to set the connection to null is a noop.

Parameters:
connection - A fully conditioned connection. The connect() method will be called so it need not be connected yet.
Throws:
ParserException - if the character set specified in the HTTP header is not supported, or an i/o exception occurs creating the lexer.
See Also:
setLexer(org.htmlparser.lexer.Lexer)

getConnection

public URLConnection getConnection()
Return the current connection.

Returns:
The connection either created by the parser or passed into this parser via setConnection.
See Also:
setConnection(URLConnection)

setURL

public void setURL(String url)
            throws ParserException
Set the URL for this parser. This method creates a new Lexer reading from the given URL. Trying to set the url to null or an empty string is a noop.

Throws:
ParserException
See Also:
setConnection(URLConnection)

getURL

public String getURL()
Return the current URL being parsed.

Returns:
The url passed into the constructor or the file name passed to the constructor modified to be a URL.

setEncoding

public void setEncoding(String encoding)
                 throws ParserException
Set the encoding for the page this parser is reading from.

Parameters:
encoding - The new character set to use.
Throws:
ParserException

getEncoding

public String getEncoding()
Get the encoding for the page this parser is reading from. This item is set from the HTTP header but may be overridden by meta tags in the head, so this may change after the head has been parsed.


setLexer

public void setLexer(Lexer lexer)
Set the lexer for this parser. The current NodeFactory is set on the given lexer, since the lexer contains the node factory object. It does not adjust the feedback object. Trying to set the lexer to null is a noop.

Parameters:
lexer - The lexer object to use.

getLexer

public Lexer getLexer()
Returns the reader associated with the parser

Returns:
The current lexer.

getNodeFactory

public NodeFactory getNodeFactory()
Get the current node factory.

Returns:
The parser's node factory.

setNodeFactory

public void setNodeFactory(NodeFactory factory)
Get the current node factory.

Returns:
The parser's node factory.

setFeedback

public void setFeedback(ParserFeedback fb)
Sets the feedback object used in scanning.

Parameters:
fb - The new feedback object to use.

getFeedback

public ParserFeedback getFeedback()
Returns the feedback.

Returns:
HTMLParserFeedback

reset

public void reset()
Reset the parser to start from the beginning again.


elements

public NodeIterator elements()
                      throws ParserException
Returns an iterator (enumeration) to the html nodes. Each node can be a tag/endtag/ string/link/image
This is perhaps the most important method of this class. In typical situations, you will need to use the parser like this :
 Parser parser = new Parser("http://www.yahoo.com");
 for (NodeIterator i = parser.elements();i.hasMoreElements();) {
    Node node = i.nextHTMLNode();
    if (node instanceof StringNode) {
      // Downcasting to StringNode
      StringNode stringNode = (StringNode)node;
      // Do whatever processing you want with the string node
      System.out.println(stringNode.getText());
    }
    // Check for the node or tag that you want
    if (node instanceof ...) {
      // Downcast, and process
      // recursively (nodes within nodes)
    }
 }
 

Throws:
ParserException

parse

public void parse(NodeFilter filter)
           throws ParserException
Parse the given resource, using the filter provided.

Parameters:
filter - The filter to apply to the parsed nodes.
Throws:
ParserException

openConnection

public static URLConnection openConnection(URL url,
                                           ParserFeedback feedback)
                                    throws ParserException
Opens a connection using the given url.

Parameters:
url - The url to open.
feedback - The ibject to use for messages or null.
Throws:
ParserException - if an i/o exception occurs accessing the url.

openConnection

public static URLConnection openConnection(String string,
                                           ParserFeedback feedback)
                                    throws ParserException
Opens a connection based on a given string. The string is either a file, in which case file://localhost is prepended to a canonical path derived from the string, or a url that begins with one of the known protocol strings, i.e. http://. Embedded spaces are silently converted to %20 sequences.

Parameters:
string - The name of a file or a url.
feedback - The object to use for messages or null for no feedback.
Throws:
ParserException - if the string is not a valid url or file.

main

public static void main(String[] args)
The main program, which can be executed from the command line


visitAllNodesWith

public void visitAllNodesWith(NodeVisitor visitor)
                       throws ParserException
Throws:
ParserException

setInputHTML

public void setInputHTML(String inputHTML)
                  throws ParserException
Initializes the parser with the given input HTML String.

Parameters:
inputHTML - the input HTML that is to be parsed.
Throws:
ParserException

extractAllNodesThatMatch

public NodeList extractAllNodesThatMatch(NodeFilter filter)
                                  throws ParserException
Extract all nodes matching the given filter.

Throws:
ParserException
See Also:
Node.collectInto(NodeList, NodeFilter)

extractAllNodesThatAre

public Node[] extractAllNodesThatAre(Class nodeType)
                              throws ParserException
Convenience method to extract all nodes of a given class type.

Throws:
ParserException
See Also:
Node.collectInto(NodeList, NodeFilter)

createParser

public static Parser createParser(String inputHTML)
Creates the parser on an input string.

Parameters:
inputHTML -
Returns:
Parser

getLineSeparator

public static String getLineSeparator()
Returns:
String lineSeparator that will be used in toHTML()

© 2004 Somik Raha
Mar 14, 2004

HTML Parser is an open source library released under LGPL.
SourceForge.net