Class Parser
- All Implemented Interfaces:
Serializable
,ConnectionMonitor
String
,
a URLConnection
, or a
Lexer
. In the case of a String,
a check is made to see if the first non-whitespace character is a <, in
which case it is assumed to be HTML. Otherwise an
attempt is made to open it as a URL, and if that fails it assumes it is a
local disk file. If you want to parse a String after using the
no-args
constructor, use
setInputHTML()
, or you can use createParser(java.lang.String, java.lang.String)
.
The Parser provides access to the contents of the
page, via a NodeIterator
, a
NodeList
or a
NodeVisitor
.
Typical usage of the parser is:
Parser parser = new Parser ("http://whatever");
NodeList list = parser.parse (null);
// do something with your list of nodes.
What types of nodes and what can be done with them is dependant on the
setup, but in general a node can be converted back to HTML and it's
children (enclosed nodes) and parent can be obtained, because nodes are
nested. See the Node
interface.
For example, if the URL contains:
invalid input: '{@'.html
and the example code above is used, the list contain only one element, the
invalid input: '{@'.html } node. This node is a tag
,
which is an object of class
Html
if the default NodeFactory
(a PrototypicalNodeFactory
) is used.
To get at further content, the children of the top
level nodes must be examined. When digging through a node list one must be
conscious of the possibility of whitespace between nodes, e.g. in the example
above:
would print out 5, not 2, because there are newlines after invalid input: '{@'.html },
invalid input: '{@'.html } and invalid input: '{@'.html } that are children of the HTML node
besides the invalid input: '{@'.html
Node node = list.elementAt (0);
NodeList sublist = node.getChildren ();
System.out.println (sublist.size ());
Because processing nodes is so common, two interfaces are provided to
ease this task, filters
and visitors
.
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final ParserFeedback
A quiet message sink.protected ParserFeedback
Feedback object.protected Lexer
The html lexer associated with this parser.static final ParserFeedback
A verbose message sink.static final String
The date of the version ("Jun 10, 2006").static final double
The floating point version number (1.6).static final String
The display version ("1.6 (Release Build Jun 10, 2006)").static final String
The type of version ("Release Build"). -
Constructor Summary
ConstructorsConstructorDescriptionParser()
Zero argument constructor.Creates a Parser object with the location of the resource (URL or file).Parser
(String resource, ParserFeedback feedback) Creates a Parser object with the location of the resource (URL or file) You would typically create a DefaultHTMLParserFeedback object and pass it in.Parser
(URLConnection connection) Construct a parser using the provided URLConnection.Parser
(URLConnection connection, ParserFeedback fb) Constructor for custom HTTP access.Construct a parser using the provided lexer.Parser
(Lexer lexer, ParserFeedback fb) Construct a parser using the provided lexer and feedback object. -
Method Summary
Modifier and TypeMethodDescriptionstatic Parser
createParser
(String html, String charset) Creates the parser on an input string.elements()
Returns an iterator (enumeration) over the html nodes.extractAllNodesThatMatch
(NodeFilter filter) Extract all nodes matching the given filter.Return the current connection.static ConnectionManager
Get the connection manager all Parsers use.Get the encoding for the page this parser is reading from.Returns the current feedback object.getLexer()
Returns the lexer associated with the parser.Get the current node factory.getURL()
Return the current URL being parsed.static String
Return the version string of this parser.static double
Return the version number of this parser.static void
The main program, which can be executed from the command line.parse
(NodeFilter filter) Parse the given resource, using the filter provided.void
postConnect
(HttpURLConnection connection) Called just after calling connect.void
preConnect
(HttpURLConnection connection) Called just prior to calling connect.void
reset()
Reset the parser to start from the beginning again.void
setConnection
(URLConnection connection) Set the connection for this parser.static void
setConnectionManager
(ConnectionManager manager) Set the connection manager all Parsers use.void
setEncoding
(String encoding) Set the encoding for the page this parser is reading from.void
Sets the feedback object used in scanning.void
setInputHTML
(String inputHTML) Initializes the parser with the given input HTML String.void
Set the lexer for this parser.void
setNodeFactory
(NodeFactory factory) Set the current node factory.void
setResource
(String resource) Set the html, a url, or a file.void
Set the URL for this parser.void
visitAllNodesWith
(NodeVisitor visitor) Apply the given visitor to the current page.
-
Field Details
-
VERSION_NUMBER
public static final double VERSION_NUMBERThe floating point version number (1.6).- See Also:
-
VERSION_TYPE
The type of version ("Release Build").- See Also:
-
VERSION_DATE
The date of the version ("Jun 10, 2006").- See Also:
-
VERSION_STRING
The display version ("1.6 (Release Build Jun 10, 2006)").- See Also:
-
mFeedback
Feedback object. -
mLexer
The html lexer associated with this parser. -
DEVNULL
A quiet message sink. Use this for no feedback. -
STDOUT
A verbose message sink. Use this for output onSystem.out
.
-
-
Constructor Details
-
Parser
public Parser()Zero argument constructor. The parser is in a safe but useless state parsing an empty string. Set the lexer or connection usingsetLexer(org.htmlparser.lexer.Lexer)
orsetConnection(java.net.URLConnection)
.- See Also:
-
Parser
Construct a parser using the provided lexer and feedback object. This would be used to create a parser for special cases where the normal creation of a lexer on a URLConnection needs to be customized.- Parameters:
lexer
- The lexer to draw characters from.fb
- The object to use when information, warning and error messages are produced. If null no feedback is provided.
-
Parser
Constructor for custom HTTP access. This would be used to create a parser for a URLConnection that needs a special setup or negotiation conditioning beyond what is available from theConnectionManager
.- Parameters:
connection
- A fully conditioned connection. The connect() method will be called so it need not be connected yet.fb
- The object to use for message communication.- Throws:
ParserException
- If the creation of the underlying Lexer cannot be performed.
-
Parser
Creates a Parser object with the location of the resource (URL or file) You would typically create a DefaultHTMLParserFeedback object and pass it in.- Parameters:
resource
- Either a URL, a filename or a string of HTML. The string is considered HTML if the first non-whitespace character is a <. The use of a url or file is autodetected by first attempting to open the resource as a URL, if that fails it is assumed to be a file name. A standard HTTP GET is performed to read the content of the URL.feedback
- The HTMLParserFeedback object to use when information, warning and error messages are produced. If null no feedback is provided.- Throws:
ParserException
- If the URL is invalid.- See Also:
-
Parser
Creates a Parser object with the location of the resource (URL or file). A DefaultHTMLParserFeedback object is used for feedback.- Parameters:
resource
- Either HTML, a URL or a filename (autodetects).- Throws:
ParserException
- If the resourceLocn argument does not resolve to a valid page or file.- See Also:
-
Parser
Construct a parser using the provided lexer. A feedback object printing toSystem.out
is used. This would be used to create a parser for special cases where the normal creation of a lexer on a URLConnection needs to be customized.- Parameters:
lexer
- The lexer to draw characters from.
-
Parser
Construct a parser using the provided URLConnection. This would be used to create a parser for a URLConnection that needs a special setup or negotiation conditioning beyond what is available from theConnectionManager
. A feedback object printing toSystem.out
is used.- Parameters:
connection
- A fully conditioned connection. The connect() method will be called so it need not be connected yet.- Throws:
ParserException
- If the creation of the underlying Lexer cannot be performed.- See Also:
-
-
Method Details
-
getVersion
Return the version string of this parser.- Returns:
- A string of the form:
"[floating point number] ([build-type] [build-date])"
-
getVersionNumber
public static double getVersionNumber()Return the version number of this parser.- Returns:
- A floating point number, the whole number part is the major version, and the fractional part is the minor version.
-
getConnectionManager
Get the connection manager all Parsers use.- Returns:
- The connection manager.
- See Also:
-
setConnectionManager
Set the connection manager all Parsers use.- Parameters:
manager
- The new connection manager.- See Also:
-
createParser
Creates the parser on an input string.- Parameters:
html
- The string containing HTML.charset
- Optional. The character set encoding that will be reported bygetEncoding()
. If charset isnull
the default character set is used.- Returns:
- A parser with the
html
string as input. - Throws:
IllegalArgumentException
- ifhtml
isnull
.
-
setResource
Set the html, a url, or a file.- Parameters:
resource
- The resource to use.- Throws:
IllegalArgumentException
- ifresource
isnull
.ParserException
- if a problem occurs in connecting.
-
setConnection
Set the connection for this parser. This method creates a newLexer
reading from the connection.- Parameters:
connection
- A fully conditioned connection. The connect() method will be called so it need not be connected yet.- Throws:
ParserException
- if the character set specified in the HTTP header is not supported, or an i/o exception occurs creating the lexer.IllegalArgumentException
- ifconnection
isnull
.ParserException
- if a problem occurs in connecting.- See Also:
-
getConnection
Return the current connection.- Returns:
- The connection either created by the parser or passed into this
parser via
setConnection(java.net.URLConnection)
. - See Also:
-
setURL
Set the URL for this parser. This method creates a new Lexer reading from the given URL. Trying to set the url to null or an empty string is a no-op.- Parameters:
url
- The new URL for the parser.- Throws:
ParserException
- If the url is invalid or creation of the underlying Lexer cannot be performed.ParserException
- if a problem occurs in connecting.- See Also:
-
getURL
Return the current URL being parsed.- Returns:
- The current url. This is the URL for the current page. A string passed into the constructor or set via setURL may be altered, for example, a file name may be modified to be a URL.
- See Also:
-
setEncoding
Set the encoding for the page this parser is reading from.- Parameters:
encoding
- The new character set to use.- Throws:
ParserException
- If the encoding change causes characters that have already been consumed to differ from the characters that would have been seen had the new encoding been in force.- See Also:
-
getEncoding
Get the encoding for the page this parser is reading from. This item is set from the HTTP header but may be overridden by meta tags in the head, so this may change after the head has been parsed.- Returns:
- The encoding currently in force.
- See Also:
-
setLexer
Set the lexer for this parser. The current NodeFactory is transferred to (set on) the given lexer, since the lexer owns the node factory object. It does not adjust thefeedback
object.- Parameters:
lexer
- The lexer object to use.- Throws:
IllegalArgumentException
- iflexer
isnull
.- See Also:
-
getLexer
Returns the lexer associated with the parser.- Returns:
- The current lexer.
- See Also:
-
getNodeFactory
Get the current node factory.- Returns:
- The current lexer's node factory.
- See Also:
-
setNodeFactory
Set the current node factory.- Parameters:
factory
- The new node factory for the current lexer.- Throws:
IllegalArgumentException
- iffactory
isnull
.- See Also:
-
setFeedback
Sets the feedback object used in scanning.- Parameters:
fb
- The new feedback object to use. If this is null asilent feedback object
is used.- See Also:
-
getFeedback
Returns the current feedback object.- Returns:
- The feedback object currently being used.
- See Also:
-
reset
public void reset()Reset the parser to start from the beginning again. This assumes support for a reset from the underlyingSource
object.This is cheaper (in terms of time) than resetting the URL, i.e.
parser.setURL (parser.getURL ());
because the page is not refetched from the internet. Note: the nodes returned on the second parse are new nodes and not the same nodes returned on the first parse. If you want the same nodes for re-use, collect them in a NodeList withparse(null)
and operate on the NodeList. -
elements
Returns an iterator (enumeration) over the html nodes.Nodes
can be of three main types: In general, when parsing with an iterator or processing a NodeList, you will need to use recursion. For example:void processMyNodes (Node node) { if (node instanceof TextNode) { // downcast to TextNode TextNode text = (TextNode)node; // do whatever processing you want with the text System.out.println (text.getText ()); } if (node instanceof RemarkNode) { // downcast to RemarkNode RemarkNode remark = (RemarkNode)node; // do whatever processing you want with the comment } else if (node instanceof TagNode) { // downcast to TagNode TagNode tag = (TagNode)node; // do whatever processing you want with the tag itself // ... // process recursively (nodes within nodes) via getChildren() NodeList nl = tag.getChildren (); if (null != nl) for (NodeIterator i = nl.elements (); i.hasMoreElements (); ) processMyNodes (i.nextNode ()); } } Parser parser = new Parser ("http://www.yahoo.com"); for (NodeIterator i = parser.elements (); i.hasMoreElements (); ) processMyNodes (i.nextNode ());
- Returns:
- An iterator over the top level nodes (usually invalid input: '{@'.html }).
- Throws:
ParserException
- If a parsing error occurs.
-
parse
Parse the given resource, using the filter provided. This can be used to extract information from specific nodes. When used with anull
filter it returns an entire page which can then be modified and converted back to HTML (Note: the synthesis use-case is not handled very well; the parser is more often used to extract information from a web page).For example, to replace the entire contents of the HEAD with a single TITLE tag you could do this:
NodeList nl = parser.parse (null); // here is your two node list NodeList heads = nl.extractAllNodesThatMatch (new TagNameFilter ("HEAD")) if (heads.size () > 0) // there may not be a HEAD tag { Head head = heads.elementAt (0); // there should be only one head.removeAll (); // clean out the contents Tag title = new TitleTag (); title.setTagName ("title"); title.setChildren (new NodeList (new TextNode ("The New Title"))); Tag title_end = new TitleTag (); title_end.setTagName ("/title"); title.setEndTag (title_end); head.add (title); } System.out.println (nl.toHtml ()); // output the modified HTML
- Parameters:
filter
- The filter to apply to the parsed nodes, ornull
to retrieve all the top level nodes.- Returns:
- The list of matching nodes (for a
null
filter this is all the top level nodes). - Throws:
ParserException
- If a parsing error occurs.
-
visitAllNodesWith
Apply the given visitor to the current page. The visitor is passed to theaccept()
method of each node in the page in a depth first traversal. The visitorbeginParsing()
method is called prior to processing the page andfinishedParsing()
is called after the processing.- Parameters:
visitor
- The visitor to visit all nodes with.- Throws:
ParserException
- If a parse error occurs while traversing the page with the visitor.
-
setInputHTML
Initializes the parser with the given input HTML String.- Parameters:
inputHTML
- the input HTML that is to be parsed.- Throws:
ParserException
- If a error occurs in setting up the underlying Lexer.IllegalArgumentException
- ifinputHTML
isnull
.
-
extractAllNodesThatMatch
Extract all nodes matching the given filter.- Parameters:
filter
- The filter to be applied to the nodes.- Returns:
- A list of nodes matching the filter criteria,
i.e. for which the filter's accept method
returned
true
. - Throws:
ParserException
- If a parse error occurs.- See Also:
-
preConnect
Called just prior to calling connect. Part of the ConnectionMonitor interface, this implementation just sends the request header to the feedback object if any.- Specified by:
preConnect
in interfaceConnectionMonitor
- Parameters:
connection
- The connection which is about to be connected.- Throws:
ParserException
- Not used- See Also:
-
postConnect
Called just after calling connect. Part of the ConnectionMonitor interface, this implementation just sends the response header to the feedback object if any.- Specified by:
postConnect
in interfaceConnectionMonitor
- Parameters:
connection
- The connection that was just connected.- Throws:
ParserException
- Not used.- See Also:
-
main
The main program, which can be executed from the command line.- Parameters:
args
- A URL or file name to parse, and an optional tag name to be used as a filter.
-