|
HTML Parser Home Page | ||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||
java.lang.Objectorg.htmlparser.Parser
This is the class that the user will use, either to get an iterator into
the html page or to directly parse the page and print the results
Typical usage of the parser is as follows :
[1] Create a parser object - passing the URL and a feedback object to the parser
[2] Enumerate through the elements from the parser object
It is important to note that the parsing occurs when you enumerate, ON DEMAND.
This is a thread-safe way, and you only get the control back after a
particular element is parsed and returned, which could be the entire body.
elements(),
Serialized Form| Field Summary | |
protected static String |
lineSeparator
Variable to store lineSeparator. |
protected static Map |
mDefaultRequestProperties
Default Request header fields. |
protected ParserFeedback |
mFeedback
Feedback object. |
protected Lexer |
mLexer
The html lexer associated with this parser. |
static ParserFeedback |
noFeedback
A quiet message sink. |
static ParserFeedback |
stdout
A verbose message sink. |
static String |
VERSION_DATE
The date of the version. |
static double |
VERSION_NUMBER
The floating point version number. |
static String |
VERSION_STRING
The display version. |
static String |
VERSION_TYPE
The type of version. |
| Constructor Summary | |
Parser()
Zero argument constructor. |
|
Parser(Lexer lexer)
This constructor is present to enable users to plugin their own lexers. |
|
Parser(Lexer lexer,
ParserFeedback fb)
This constructor enables the construction of test cases, with readers associated with test string buffers. |
|
Parser(String resourceLocn)
Creates a Parser object with the location of the resource (URL or file). |
|
Parser(String resourceLocn,
ParserFeedback feedback)
Creates a Parser object with the location of the resource (URL or file) You would typically create a DefaultHTMLParserFeedback object and pass it in. |
|
Parser(URLConnection connection)
Constructor for non-standard access. |
|
Parser(URLConnection connection,
ParserFeedback fb)
Constructor for custom HTTP access. |
|
| Method Summary | |
static Parser |
createParser(String inputHTML)
Creates the parser on an input string. |
NodeIterator |
elements()
Returns an iterator (enumeration) to the html nodes. |
Node[] |
extractAllNodesThatAre(Class nodeType)
Convenience method to extract all nodes of a given class type. |
NodeList |
extractAllNodesThatMatch(NodeFilter filter)
Extract all nodes matching the given filter. |
URLConnection |
getConnection()
Return the current connection. |
static Map |
getDefaultRequestProperties()
Get the current default request header properties. |
String |
getEncoding()
Get the encoding for the page this parser is reading from. |
ParserFeedback |
getFeedback()
Returns the feedback. |
Lexer |
getLexer()
Returns the reader associated with the parser |
static String |
getLineSeparator()
|
NodeFactory |
getNodeFactory()
Get the current node factory. |
String |
getURL()
Return the current URL being parsed. |
static String |
getVersion()
Return the version string of this parser. |
static double |
getVersionNumber()
Return the version number of this parser. |
static void |
main(String[] args)
The main program, which can be executed from the command line |
static URLConnection |
openConnection(String string,
ParserFeedback feedback)
Opens a connection based on a given string. |
static URLConnection |
openConnection(URL url,
ParserFeedback feedback)
Opens a connection using the given url. |
void |
parse(NodeFilter filter)
Parse the given resource, using the filter provided. |
void |
reset()
Reset the parser to start from the beginning again. |
void |
setConnection(URLConnection connection)
Set the connection for this parser. |
static void |
setDefaultRequestProperties(Map properties)
Set the default request header properties. |
void |
setEncoding(String encoding)
Set the encoding for the page this parser is reading from. |
void |
setFeedback(ParserFeedback fb)
Sets the feedback object used in scanning. |
void |
setInputHTML(String inputHTML)
Initializes the parser with the given input HTML String. |
void |
setLexer(Lexer lexer)
Set the lexer for this parser. |
static void |
setLineSeparator(String lineSeparatorString)
|
void |
setNodeFactory(NodeFactory factory)
Get the current node factory. |
void |
setURL(String url)
Set the URL for this parser. |
void |
visitAllNodesWith(NodeVisitor visitor)
|
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
public static final double VERSION_NUMBER
public static final String VERSION_TYPE
public static final String VERSION_DATE
public static final String VERSION_STRING
protected static Map mDefaultRequestProperties
protected ParserFeedback mFeedback
protected Lexer mLexer
protected static String lineSeparator
line.separator from the System property.
However it can also be changed using the mutator methods.
This will be used in the toHTML() methods in all the sub-classes of Node.
public static ParserFeedback noFeedback
public static ParserFeedback stdout
System.out.
| Constructor Detail |
public Parser()
setLexer(Lexer),
setConnection(URLConnection)
public Parser(Lexer lexer,
ParserFeedback fb)
parser.getReader().reset();
lexer - The lexer to draw characters from.fb - The object to use when information,
warning and error messages are produced. If null no feedback
is provided.
public Parser(URLConnection connection,
ParserFeedback fb)
throws ParserException
connection - A fully conditioned connection. The connect()
method will be called so it need not be connected yet.fb - The object to use for message communication.
public Parser(String resourceLocn,
ParserFeedback feedback)
throws ParserException
resourceLocn - Either the URL or the filename (autodetects).
A standard HTTP GET is performed to read the content of the URL.feedback - The HTMLParserFeedback object to use when information,
warning and error messages are produced. If null no feedback
is provided.Parser(URLConnection,ParserFeedback)
public Parser(String resourceLocn)
throws ParserException
resourceLocn - Either the URL or the filename (autodetects).public Parser(Lexer lexer)
parser.getReader().reset();
lexer - The source for HTML to be parsed.
public Parser(URLConnection connection)
throws ParserException
connection - A fully conditioned connection. The connect()
method will be called so it need not be connected yet.Parser(URLConnection,ParserFeedback)| Method Detail |
public static void setLineSeparator(String lineSeparatorString)
lineSeparatorString - New Line separator to be usedpublic static String getVersion()
"[floating point number] ([build-type] [build-date])"
public static double getVersionNumber()
public static Map getDefaultRequestProperties()
public static void setDefaultRequestProperties(Map properties)
URLConnection,
i.e. If-Modified-Since is set with setIfModifiedSince(long),
but since the parser transparently opens the connection on behalf
of the developer, these properties are not available before the
connection is fetched. Setting these request header fields affects all
subsequent connections opened by the parser. For more direct control
create a URLConnection and set it on the parser.From RFC 2616 Hypertext Transfer Protocol -- HTTP/1.1:
5.3 Request Header Fields
The request-header fields allow the client to pass additional
information about the request, and about the client itself, to the
server. These fields act as request modifiers, with semantics
equivalent to the parameters on a programming language method
invocation.
request-header = Accept ; Section 14.1
| Accept-Charset ; Section 14.2
| Accept-Encoding ; Section 14.3
| Accept-Language ; Section 14.4
| Authorization ; Section 14.8
| Expect ; Section 14.20
| From ; Section 14.22
| Host ; Section 14.23
| If-Match ; Section 14.24
| If-Modified-Since ; Section 14.25
| If-None-Match ; Section 14.26
| If-Range ; Section 14.27
| If-Unmodified-Since ; Section 14.28
| Max-Forwards ; Section 14.31
| Proxy-Authorization ; Section 14.34
| Range ; Section 14.35
| Referer ; Section 14.36
| TE ; Section 14.39
| User-Agent ; Section 14.43
Request-header field names can be extended reliably only in
combination with a change in the protocol version. However, new or
experimental header fields MAY be given the semantics of request-
header fields if all parties in the communication recognize them to
be request-header fields. Unrecognized header fields are treated as
entity-header fields.
public void setConnection(URLConnection connection)
throws ParserException
Lexer reading from the connection.
Trying to set the connection to null is a noop.
connection - A fully conditioned connection. The connect()
method will be called so it need not be connected yet.
ParserException - if the character set specified in the
HTTP header is not supported, or an i/o exception occurs creating the
lexer.setLexer(org.htmlparser.lexer.Lexer)public URLConnection getConnection()
setConnection.setConnection(URLConnection)
public void setURL(String url)
throws ParserException
ParserExceptionsetConnection(URLConnection)public String getURL()
public void setEncoding(String encoding)
throws ParserException
encoding - The new character set to use.
ParserExceptionpublic String getEncoding()
public void setLexer(Lexer lexer)
feedback object.
Trying to set the lexer to null is a noop.
lexer - The lexer object to use.public Lexer getLexer()
public NodeFactory getNodeFactory()
public void setNodeFactory(NodeFactory factory)
public void setFeedback(ParserFeedback fb)
fb - The new feedback object to use.public ParserFeedback getFeedback()
public void reset()
public NodeIterator elements()
throws ParserException
Parser parser = new Parser("http://www.yahoo.com");
for (NodeIterator i = parser.elements();i.hasMoreElements();) {
Node node = i.nextHTMLNode();
if (node instanceof StringNode) {
// Downcasting to StringNode
StringNode stringNode = (StringNode)node;
// Do whatever processing you want with the string node
System.out.println(stringNode.getText());
}
// Check for the node or tag that you want
if (node instanceof ...) {
// Downcast, and process
// recursively (nodes within nodes)
}
}
ParserException
public void parse(NodeFilter filter)
throws ParserException
filter - The filter to apply to the parsed nodes.
ParserException
public static URLConnection openConnection(URL url,
ParserFeedback feedback)
throws ParserException
url - The url to open.feedback - The ibject to use for messages or null.
ParserException - if an i/o exception occurs accessing the url.
public static URLConnection openConnection(String string,
ParserFeedback feedback)
throws ParserException
file://localhost
is prepended to a canonical path derived from the string, or a url that
begins with one of the known protocol strings, i.e. http://.
Embedded spaces are silently converted to %20 sequences.
string - The name of a file or a url.feedback - The object to use for messages or null for no feedback.
ParserException - if the string is not a valid url or file.public static void main(String[] args)
public void visitAllNodesWith(NodeVisitor visitor)
throws ParserException
ParserException
public void setInputHTML(String inputHTML)
throws ParserException
inputHTML - the input HTML that is to be parsed.
ParserException
public NodeList extractAllNodesThatMatch(NodeFilter filter)
throws ParserException
ParserExceptionNode.collectInto(NodeList, NodeFilter)
public Node[] extractAllNodesThatAre(Class nodeType)
throws ParserException
ParserExceptionNode.collectInto(NodeList, NodeFilter)public static Parser createParser(String inputHTML)
inputHTML -
public static String getLineSeparator()
|
© 2004 Somik Raha Mar 14, 2004
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||