HTML Parser Home Page

org.htmlparser.lexer
Class Page

java.lang.Object
  extended byorg.htmlparser.lexer.Page
All Implemented Interfaces:
Serializable

public class Page
extends Object
implements Serializable

Represents the contents of an HTML page. Contains the source of characters and an index of positions of line separators (actually the first character position on the next line).

See Also:
Serialized Form

Field Summary
static String DEFAULT_CHARSET
          The default charset.
static String DEFAULT_CONTENT_TYPE
          The default content type.
protected  URLConnection mConnection
          The connection this page is coming from or null.
protected  PageIndex mIndex
          Character positions of the first character in each line.
protected  LinkProcessor mProcessor
          The processor of relative links on this page.
protected  Source mSource
          The source of characters.
protected  String mUrl
          The URL this page is coming from.
 
Constructor Summary
Page()
          Construct an empty page.
Page(InputStream stream, String charset)
          Construct a page from a stream encoded with the given charset.
Page(String text)
           
Page(URLConnection connection)
          Construct a page reading from a URL connection.
 
Method Summary
 int column(Cursor cursor)
          Get the column number for a cursor.
 int column(int position)
          Get the column number for a cursor.
 String findCharset(String name, String _default)
          Lookup a character set name.
 char getCharacter(Cursor cursor)
          Read the character at the cursor position.
 String getCharset(String content)
          Get a CharacterSet name corresponding to a charset parameter.
 URLConnection getConnection()
          Get the connection, if any.
 String getContentType()
          Try and extract the content type from the HTTP header.
 String getEncoding()
          Get the current encoding being used.
 String getLine(Cursor cursor)
          Get the text line the position of the cursor lies on.
 String getLine(int position)
          Get the text line the position of the cursor lies on.
 LinkProcessor getLinkProcessor()
          Get the link processor associated with this page.
 Source getSource()
          Get the source this page is reading from.
 String getText()
          Get all text read so far from the source.
 String getText(int start, int end)
          Get the text identified by the given limits.
 void getText(StringBuffer buffer)
          Put all text read so far from the source into the given buffer.
 void getText(StringBuffer buffer, int start, int end)
          Put the text identified by the given limits into the given buffer.
 String getUrl()
          Get the URL for this page.
 void reset()
          Reset the page by resetting the source of characters.
 int row(Cursor cursor)
          Get the line number for a cursor.
 int row(int position)
          Get the line number for a cursor.
 void setConnection(URLConnection connection)
          Set the URLConnection to be used by this page.
 void setEncoding(String character_set)
          Begins reading from the source with the given character set.
 void setLinkProcessor(LinkProcessor processor)
          Set the link processor associated with this page.
 void setUrl(String url)
          Set the URL for this page.
 String toString()
          Display some of this page as a string.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

DEFAULT_CHARSET

public static final String DEFAULT_CHARSET
The default charset. This should be ISO-8859-1, see RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) section 3.7.1 Another alias is "8859_1".

See Also:
Constant Field Values

DEFAULT_CONTENT_TYPE

public static final String DEFAULT_CONTENT_TYPE
The default content type. In the absence of alternate information, assume html content.

See Also:
Constant Field Values

mUrl

protected String mUrl
The URL this page is coming from. Cached value of getConnection().toExternalForm() or setUrl().


mSource

protected Source mSource
The source of characters.


mIndex

protected PageIndex mIndex
Character positions of the first character in each line.


mConnection

protected transient URLConnection mConnection
The connection this page is coming from or null.


mProcessor

protected LinkProcessor mProcessor
The processor of relative links on this page. Holds any overridden base HREF.

Constructor Detail

Page

public Page()
Construct an empty page.


Page

public Page(URLConnection connection)
     throws ParserException
Construct a page reading from a URL connection.

Parameters:
connection - A fully conditioned connection. The connect() method will be called so it need not be connected yet.
Throws:
ParserException - An exception object wrapping a number of possible error conditions, some of which are outlined below.
  • IOException If an i/o exception occurs creating the source.
  • UnsupportedEncodingException if the character set specified in the HTTP header is not supported.

  • Page

    public Page(InputStream stream,
                String charset)
         throws UnsupportedEncodingException
    Construct a page from a stream encoded with the given charset.

    Parameters:
    stream - The source of bytes.
    charset - The encoding used. If null, defaults to the DEFAULT_CHARSET.
    Throws:
    UnsupportedEncodingException - If the given charset is not supported.

    Page

    public Page(String text)
    Method Detail

    reset

    public void reset()
    Reset the page by resetting the source of characters.


    getConnection

    public URLConnection getConnection()
    Get the connection, if any.

    Returns:
    The connection object for this page, or null if this page is built from a stream or a string.

    setConnection

    public void setConnection(URLConnection connection)
                       throws ParserException
    Set the URLConnection to be used by this page. Starts reading from the given connection. This also resets the current url.

    Parameters:
    connection - The connection to use. It will be connected by this method.
    Throws:
    ParserException - If the connect() method fails, or an I/O error occurs opening the input stream or the character set designated in the HTTP header is unsupported.

    getUrl

    public String getUrl()
    Get the URL for this page. This is only available if the page has a connection (getConnection() returns non-null), or the document base has been set via a call to setUrl().

    Returns:
    The url for the connection, or null if there is no conenction or the document base has not been set.

    setUrl

    public void setUrl(String url)
    Set the URL for this page. This doesn't affect the contents of the page, just the interpretation of relative links from this point forward.

    Parameters:
    url - The new URL.

    getSource

    public Source getSource()
    Get the source this page is reading from.


    getContentType

    public String getContentType()
    Try and extract the content type from the HTTP header.

    Returns:
    The content type.

    getCharacter

    public char getCharacter(Cursor cursor)
                      throws ParserException
    Read the character at the cursor position. The cursor position can be behind or equal to the current source position. Returns end of lines (EOL) as \n, by converting \r and \r\n to \n, and updates the end-of-line index accordingly Advances the cursor position by one (or two in the \r\n case).

    Parameters:
    cursor - The position to read at.
    Returns:
    The character at that position, and modifies the cursor to prepare for the next read. If the source is exhausted a zero is returned.
    Throws:
    ParserException - If an IOException on the underlying source occurs, or an attemp is made to read characters in the future (the cursor position is ahead of the underlying stream)

    getCharset

    public String getCharset(String content)
    Get a CharacterSet name corresponding to a charset parameter.

    Parameters:
    content - A text line of the form:
     text/html; charset=Shift_JIS
     
    which is applicable both to the HTTP header field Content-Type and the meta tag http-equiv="Content-Type". Note this method also handles non-compliant quoted charset directives such as:
     text/html; charset="UTF-8"
     
    and
     text/html; charset='UTF-8'
     
    Returns:
    The character set name to use when reading the input stream. For JDKs that have the Charset class this is qualified by passing the name to findCharset() to render it into canonical form. If the charset parameter is not found in the given string, the default character set is returned.
    See Also:
    findCharset(java.lang.String, java.lang.String), DEFAULT_CHARSET

    findCharset

    public String findCharset(String name,
                              String _default)
    Lookup a character set name. Vacuous for JVM's without java.nio.charset. This uses reflection so the code will still run under prior JDK's but in that case the default is always returned.

    Parameters:
    name - The name to look up. One of the aliases for a character set.
    _default - The name to return if the lookup fails.

    getEncoding

    public String getEncoding()
    Get the current encoding being used.

    Returns:
    The encoding used to convert characters.

    setEncoding

    public void setEncoding(String character_set)
                     throws ParserException
    Begins reading from the source with the given character set. If the current encoding is the same as the requested encoding, this method is a no-op. Otherwise any subsequent characters read from this page will have been decoded using the given character set.

    Some magic happens here to obtain this result if characters have already been consumed from this page. Since a Reader cannot be dynamically altered to use a different character set, the underlying stream is reset, a new Source is constructed and a comparison made of the characters read so far with the newly read characters up to the current position. If a difference is encountered, or some other problem occurs, an exception is thrown.

    Parameters:
    character_set - The character set to use to convert bytes into characters.
    Throws:
    ParserException - If a character mismatch occurs between characters already provided and those that would have been returned had the new character set been in effect from the beginning. An exception is also thrown if the underlying stream won't put up with these shenanigans.

    getLinkProcessor

    public LinkProcessor getLinkProcessor()
    Get the link processor associated with this page.

    Returns:
    The link processor that has the base HREF.

    setLinkProcessor

    public void setLinkProcessor(LinkProcessor processor)
    Set the link processor associated with this page.

    Parameters:
    processor - The new link processor for this page.

    row

    public int row(Cursor cursor)
    Get the line number for a cursor.

    Parameters:
    cursor - The character offset into the page.
    Returns:
    The line number the character is in.

    row

    public int row(int position)
    Get the line number for a cursor.

    Parameters:
    position - The character offset into the page.
    Returns:
    The line number the character is in.

    column

    public int column(Cursor cursor)
    Get the column number for a cursor.

    Parameters:
    cursor - The character offset into the page.
    Returns:
    The character offset into the line this cursor is on.

    column

    public int column(int position)
    Get the column number for a cursor.

    Parameters:
    position - The character offset into the page.
    Returns:
    The character offset into the line this cursor is on.

    getText

    public String getText(int start,
                          int end)
    Get the text identified by the given limits.

    Parameters:
    start - The starting position, zero based.
    end - The ending position (exclusive, i.e. the character at the ending position is not included), zero based.
    Returns:
    The text from start to end.
    Throws:
    IllegalArgumentException - If an attempt is made to get characters ahead of the current source offset (character position).
    See Also:
    getText(StringBuffer, int, int)

    getText

    public void getText(StringBuffer buffer,
                        int start,
                        int end)
    Put the text identified by the given limits into the given buffer.

    Parameters:
    buffer - The accumulator for the characters.
    start - The starting position, zero based.
    end - The ending position (exclusive, i.e. the character at the ending position is not included), zero based.
    Throws:
    IllegalArgumentException - If an attempt is made to get characters ahead of the current source offset (character position).

    getText

    public String getText()
    Get all text read so far from the source.

    Returns:
    The text from the source.
    See Also:
    getText(StringBuffer)

    getText

    public void getText(StringBuffer buffer)
    Put all text read so far from the source into the given buffer.

    Parameters:
    buffer - The accumulator for the characters.
    See Also:
    getText(StringBuffer,int,int)

    getLine

    public String getLine(Cursor cursor)
    Get the text line the position of the cursor lies on.

    Parameters:
    cursor - The position to calculate for.
    Returns:
    The contents of the URL or file corresponding to the line number containg the cursor position.

    getLine

    public String getLine(int position)
    Get the text line the position of the cursor lies on.

    Parameters:
    position - The position to calculate for.
    Returns:
    The contents of the URL or file corresponding to the line number containg the cursor position.

    toString

    public String toString()
    Display some of this page as a string.

    Returns:
    The last few characters the source read in.

    © 2004 Somik Raha
    Mar 14, 2004

    HTML Parser is an open source library released under LGPL.
    SourceForge.net